feat: Implement async queue-based news pipeline with microservices

Major architectural transformation from synchronous to asynchronous processing:

## Pipeline Services (8 microservices)
- pipeline-scheduler: APScheduler for 30-minute periodic job triggers
- pipeline-rss-collector: RSS feed collection with deduplication (7-day TTL)
- pipeline-google-search: Content enrichment via Google Search API
- pipeline-ai-summarizer: AI summarization using Claude API (claude-sonnet-4-20250514)
- pipeline-translator: Translation using DeepL Pro API
- pipeline-image-generator: Image generation with Replicate API (Stable Diffusion)
- pipeline-article-assembly: Final article assembly and MongoDB storage
- pipeline-monitor: Real-time monitoring dashboard (port 8100)

## Key Features
- Redis-based job queue with deduplication
- Asynchronous processing with Python asyncio
- Shared models and queue manager for inter-service communication
- Docker containerization for all services
- Container names standardized with site11_ prefix

## Removed Services
- Moved to backup: google-search, rss-feed, news-aggregator, ai-writer

## Configuration
- DeepL Pro API: 3abbc796-2515-44a8-972d-22dcf27ab54a
- Claude Model: claude-sonnet-4-20250514
- Redis Queue TTL: 7 days for deduplication

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
jungwoo choi
2025-09-13 19:22:14 +09:00
parent 1d90af7c3c
commit 070032006e
73 changed files with 5922 additions and 4 deletions

View File

@ -0,0 +1,90 @@
# Pipeline Makefile
.PHONY: help build up down restart logs clean test monitor
help:
@echo "Pipeline Management Commands:"
@echo " make build - Build all Docker images"
@echo " make up - Start all services"
@echo " make down - Stop all services"
@echo " make restart - Restart all services"
@echo " make logs - View logs for all services"
@echo " make clean - Clean up containers and volumes"
@echo " make monitor - Open monitor dashboard"
@echo " make test - Test pipeline with sample keyword"
build:
docker-compose build
up:
docker-compose up -d
down:
docker-compose down
restart:
docker-compose restart
logs:
docker-compose logs -f
clean:
docker-compose down -v
docker system prune -f
monitor:
@echo "Opening monitor dashboard..."
@echo "Dashboard: http://localhost:8100"
@echo "API Docs: http://localhost:8100/docs"
test:
@echo "Testing pipeline with sample keyword..."
curl -X POST http://localhost:8100/api/keywords \
-H "Content-Type: application/json" \
-d '{"keyword": "테스트", "schedule": "30min"}'
@echo "\nTriggering immediate processing..."
curl -X POST http://localhost:8100/api/trigger/테스트
# Service-specific commands
scheduler-logs:
docker-compose logs -f scheduler
rss-logs:
docker-compose logs -f rss-collector
search-logs:
docker-compose logs -f google-search
summarizer-logs:
docker-compose logs -f ai-summarizer
assembly-logs:
docker-compose logs -f article-assembly
monitor-logs:
docker-compose logs -f monitor
# Database commands
redis-cli:
docker-compose exec redis redis-cli
mongo-shell:
docker-compose exec mongodb mongosh -u admin -p password123
# Queue management
queue-status:
@echo "Checking queue status..."
docker-compose exec redis redis-cli --raw LLEN queue:keyword
docker-compose exec redis redis-cli --raw LLEN queue:rss
docker-compose exec redis redis-cli --raw LLEN queue:search
docker-compose exec redis redis-cli --raw LLEN queue:summarize
docker-compose exec redis redis-cli --raw LLEN queue:assembly
queue-clear:
@echo "Clearing all queues..."
docker-compose exec redis redis-cli FLUSHDB
# Health check
health:
@echo "Checking service health..."
curl -s http://localhost:8100/api/health | python3 -m json.tool

154
services/pipeline/README.md Normal file
View File

@ -0,0 +1,154 @@
# News Pipeline System
비동기 큐 기반 뉴스 생성 파이프라인 시스템
## 아키텍처
```
Scheduler → RSS Collector → Google Search → AI Summarizer → Article Assembly → MongoDB
↓ ↓ ↓ ↓ ↓
Redis Queue Redis Queue Redis Queue Redis Queue Redis Queue
```
## 서비스 구성
### 1. Scheduler
- 30분마다 등록된 키워드 처리
- 오전 7시, 낮 12시, 저녁 6시 우선 처리
- MongoDB에서 키워드 로드 후 큐에 작업 생성
### 2. RSS Collector
- RSS 피드 수집 (Google News RSS)
- 7일간 중복 방지 (Redis Set)
- 키워드 관련성 필터링
### 3. Google Search
- RSS 아이템별 추가 검색 결과 수집
- 아이템당 최대 3개 결과
- 작업당 최대 5개 아이템 처리
### 4. AI Summarizer
- Claude Haiku로 빠른 요약 생성
- 200자 이내 한국어 요약
- 병렬 처리 지원 (3 workers)
### 5. Article Assembly
- Claude Sonnet으로 종합 기사 작성
- 1500자 이내 전문 기사
- MongoDB 저장 및 통계 업데이트
### 6. Monitor
- 실시간 파이프라인 모니터링
- 큐 상태, 워커 상태 확인
- REST API 제공 (포트 8100)
## 시작하기
### 1. 환경 변수 설정
```bash
# .env 파일 확인
CLAUDE_API_KEY=your_claude_api_key
GOOGLE_API_KEY=your_google_api_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id
```
### 2. 서비스 시작
```bash
cd pipeline
docker-compose up -d
```
### 3. 모니터링
```bash
# 로그 확인
docker-compose logs -f
# 특정 서비스 로그
docker-compose logs -f scheduler
# 모니터 API
curl http://localhost:8100/api/stats
```
## API 엔드포인트
### Monitor API (포트 8100)
- `GET /api/stats` - 전체 통계
- `GET /api/queues/{queue_name}` - 큐 상세 정보
- `GET /api/keywords` - 키워드 목록
- `POST /api/keywords` - 키워드 등록
- `DELETE /api/keywords/{id}` - 키워드 삭제
- `GET /api/articles` - 기사 목록
- `GET /api/articles/{id}` - 기사 상세
- `GET /api/workers` - 워커 상태
- `POST /api/trigger/{keyword}` - 수동 처리 트리거
- `GET /api/health` - 헬스 체크
## 키워드 등록 예시
```bash
# 새 키워드 등록
curl -X POST http://localhost:8100/api/keywords \
-H "Content-Type: application/json" \
-d '{"keyword": "인공지능", "schedule": "30min"}'
# 수동 처리 트리거
curl -X POST http://localhost:8100/api/trigger/인공지능
```
## 데이터베이스
### MongoDB Collections
- `keywords` - 등록된 키워드
- `articles` - 생성된 기사
- `keyword_stats` - 키워드별 통계
### Redis Keys
- `queue:*` - 작업 큐
- `processing:*` - 처리 중 작업
- `failed:*` - 실패한 작업
- `dedup:rss:*` - RSS 중복 방지
- `workers:*:active` - 활성 워커
## 트러블슈팅
### 큐 초기화
```bash
docker-compose exec redis redis-cli FLUSHDB
```
### 워커 재시작
```bash
docker-compose restart rss-collector
```
### 데이터베이스 접속
```bash
# MongoDB
docker-compose exec mongodb mongosh -u admin -p password123
# Redis
docker-compose exec redis redis-cli
```
## 스케일링
워커 수 조정:
```yaml
# docker-compose.yml
ai-summarizer:
deploy:
replicas: 5 # 워커 수 증가
```
## 모니터링 대시보드
브라우저에서 http://localhost:8100 접속하여 파이프라인 상태 확인
## 로그 레벨 설정
`.env` 파일에서 조정:
```
LOG_LEVEL=DEBUG # INFO, WARNING, ERROR
```

View File

@ -0,0 +1,19 @@
FROM python:3.11-slim
WORKDIR /app
# 의존성 설치
COPY ./ai-summarizer/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 공통 모듈 복사
COPY ./shared /app/shared
# AI Summarizer 코드 복사
COPY ./ai-summarizer /app
# 환경변수
ENV PYTHONUNBUFFERED=1
# 실행
CMD ["python", "ai_summarizer.py"]

View File

@ -0,0 +1,161 @@
"""
AI Summarizer Service
Claude API를 사용한 뉴스 요약 서비스
"""
import asyncio
import logging
import os
import sys
from typing import List, Dict, Any
from anthropic import AsyncAnthropic
# Import from shared module
from shared.models import PipelineJob, EnrichedItem, SummarizedItem
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class AISummarizerWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.claude_api_key = os.getenv("CLAUDE_API_KEY")
self.claude_client = None
async def start(self):
"""워커 시작"""
logger.info("Starting AI Summarizer Worker")
# Redis 연결
await self.queue_manager.connect()
# Claude 클라이언트 초기화
if self.claude_api_key:
self.claude_client = AsyncAnthropic(api_key=self.claude_api_key)
else:
logger.error("Claude API key not configured")
return
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기
job = await self.queue_manager.dequeue('ai_summarization', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""AI 요약 작업 처리"""
try:
logger.info(f"Processing job {job.job_id} for AI summarization")
enriched_items = job.data.get('enriched_items', [])
summarized_items = []
for item_data in enriched_items:
enriched_item = EnrichedItem(**item_data)
# AI 요약 생성
summary = await self._generate_summary(enriched_item)
summarized_item = SummarizedItem(
enriched_item=enriched_item,
ai_summary=summary,
summary_language='ko'
)
summarized_items.append(summarized_item)
# API 속도 제한
await asyncio.sleep(1)
if summarized_items:
logger.info(f"Summarized {len(summarized_items)} items")
# 다음 단계로 전달 (번역 단계로)
job.data['summarized_items'] = [item.dict() for item in summarized_items]
job.stages_completed.append('ai_summarization')
job.stage = 'translation'
await self.queue_manager.enqueue('translation', job)
await self.queue_manager.mark_completed('ai_summarization', job.job_id)
else:
logger.warning(f"No items summarized for job {job.job_id}")
await self.queue_manager.mark_failed(
'ai_summarization',
job,
"No items to summarize"
)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
await self.queue_manager.mark_failed('ai_summarization', job, str(e))
async def _generate_summary(self, enriched_item: EnrichedItem) -> str:
"""Claude를 사용한 요약 생성"""
try:
# 컨텐츠 준비
content_parts = [
f"제목: {enriched_item.rss_item.title}",
f"요약: {enriched_item.rss_item.summary or '없음'}"
]
# 검색 결과 추가
if enriched_item.search_results:
content_parts.append("\n관련 검색 결과:")
for idx, result in enumerate(enriched_item.search_results[:3], 1):
content_parts.append(f"{idx}. {result.title}")
if result.snippet:
content_parts.append(f" {result.snippet}")
content = "\n".join(content_parts)
# Claude API 호출
prompt = f"""다음 뉴스 내용을 200자 이내로 핵심만 요약해주세요.
중요한 사실, 수치, 인물, 조직을 포함하고 객관적인 톤을 유지하세요.
{content}
요약:"""
response = await self.claude_client.messages.create(
model="claude-sonnet-4-20250514", # 최신 Sonnet 모델
max_tokens=500,
temperature=0.3,
messages=[
{"role": "user", "content": prompt}
]
)
summary = response.content[0].text.strip()
return summary
except Exception as e:
logger.error(f"Error generating summary: {e}")
# 폴백: 원본 요약 사용
return enriched_item.rss_item.summary[:200] if enriched_item.rss_item.summary else enriched_item.rss_item.title
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
logger.info("AI Summarizer Worker stopped")
async def main():
"""메인 함수"""
worker = AISummarizerWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,3 @@
anthropic==0.50.0
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,19 @@
FROM python:3.11-slim
WORKDIR /app
# 의존성 설치
COPY ./article-assembly/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 공통 모듈 복사
COPY ./shared /app/shared
# Article Assembly 코드 복사
COPY ./article-assembly /app
# 환경변수
ENV PYTHONUNBUFFERED=1
# 실행
CMD ["python", "article_assembly.py"]

View File

@ -0,0 +1,234 @@
"""
Article Assembly Service
최종 기사 조립 및 MongoDB 저장 서비스
"""
import asyncio
import logging
import os
import sys
import json
from datetime import datetime
from typing import List, Dict, Any
from anthropic import AsyncAnthropic
from motor.motor_asyncio import AsyncIOMotorClient
# Import from shared module
from shared.models import PipelineJob, SummarizedItem, FinalArticle
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ArticleAssemblyWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.claude_api_key = os.getenv("CLAUDE_API_KEY")
self.claude_client = None
self.mongodb_url = os.getenv("MONGODB_URL", "mongodb://mongodb:27017")
self.db_name = os.getenv("DB_NAME", "pipeline_db")
self.db = None
async def start(self):
"""워커 시작"""
logger.info("Starting Article Assembly Worker")
# Redis 연결
await self.queue_manager.connect()
# MongoDB 연결
client = AsyncIOMotorClient(self.mongodb_url)
self.db = client[self.db_name]
# Claude 클라이언트 초기화
if self.claude_api_key:
self.claude_client = AsyncAnthropic(api_key=self.claude_api_key)
else:
logger.error("Claude API key not configured")
return
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기
job = await self.queue_manager.dequeue('article_assembly', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""최종 기사 조립 작업 처리"""
try:
start_time = datetime.now()
logger.info(f"Processing job {job.job_id} for article assembly")
summarized_items = job.data.get('summarized_items', [])
if not summarized_items:
logger.warning(f"No items to assemble for job {job.job_id}")
await self.queue_manager.mark_failed(
'article_assembly',
job,
"No items to assemble"
)
return
# 최종 기사 생성
article = await self._generate_final_article(job, summarized_items)
# 처리 시간 계산
processing_time = (datetime.now() - start_time).total_seconds()
article.processing_time = processing_time
# MongoDB에 저장
await self.db.articles.insert_one(article.dict())
logger.info(f"Article {article.article_id} saved to MongoDB")
# 완료 표시
job.stages_completed.append('article_assembly')
await self.queue_manager.mark_completed('article_assembly', job.job_id)
# 통계 업데이트
await self._update_statistics(job.keyword_id)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
await self.queue_manager.mark_failed('article_assembly', job, str(e))
async def _generate_final_article(
self,
job: PipelineJob,
summarized_items: List[Dict]
) -> FinalArticle:
"""Claude를 사용한 최종 기사 생성"""
# 아이템 정보 준비
items_text = []
for idx, item_data in enumerate(summarized_items, 1):
item = SummarizedItem(**item_data)
items_text.append(f"""
[뉴스 {idx}]
제목: {item.enriched_item['rss_item']['title']}
요약: {item.ai_summary}
출처: {item.enriched_item['rss_item']['link']}
""")
content = "\n".join(items_text)
# Claude로 종합 기사 작성
prompt = f"""다음 뉴스 항목들을 바탕으로 종합적인 기사를 작성해주세요.
키워드: {job.keyword}
뉴스 항목들:
{content}
다음 JSON 형식으로 작성해주세요:
{{
"title": "종합 기사 제목",
"content": "기사 본문 (1500자 이내, 문단 구분)",
"summary": "한 줄 요약 (100자 이내)",
"categories": ["카테고리1", "카테고리2"],
"tags": ["태그1", "태그2", "태그3"]
}}
요구사항:
- 전문적이고 객관적인 톤
- 핵심 정보와 트렌드 파악
- 시사점 포함
- 한국 독자 대상"""
try:
response = await self.claude_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=3000,
temperature=0.7,
messages=[
{"role": "user", "content": prompt}
]
)
# JSON 파싱
content_text = response.content[0].text
json_start = content_text.find('{')
json_end = content_text.rfind('}') + 1
if json_start != -1 and json_end > json_start:
article_data = json.loads(content_text[json_start:json_end])
else:
raise ValueError("No valid JSON in response")
# FinalArticle 생성
article = FinalArticle(
job_id=job.job_id,
keyword_id=job.keyword_id,
keyword=job.keyword,
title=article_data.get('title', f"{job.keyword} 종합 뉴스"),
content=article_data.get('content', ''),
summary=article_data.get('summary', ''),
source_items=[], # 간소화
images=[], # 이미지는 별도 서비스에서 처리
categories=article_data.get('categories', []),
tags=article_data.get('tags', []),
pipeline_stages=job.stages_completed,
processing_time=0 # 나중에 업데이트
)
return article
except Exception as e:
logger.error(f"Error generating article: {e}")
# 폴백 기사 생성
return FinalArticle(
job_id=job.job_id,
keyword_id=job.keyword_id,
keyword=job.keyword,
title=f"{job.keyword} 뉴스 요약 - {datetime.now().strftime('%Y-%m-%d')}",
content=content,
summary=f"{job.keyword} 관련 {len(summarized_items)}개 뉴스 요약",
source_items=[],
images=[],
categories=['자동생성'],
tags=[job.keyword],
pipeline_stages=job.stages_completed,
processing_time=0
)
async def _update_statistics(self, keyword_id: str):
"""키워드별 통계 업데이트"""
try:
await self.db.keyword_stats.update_one(
{"keyword_id": keyword_id},
{
"$inc": {"articles_generated": 1},
"$set": {"last_generated": datetime.now()}
},
upsert=True
)
except Exception as e:
logger.error(f"Error updating statistics: {e}")
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
logger.info("Article Assembly Worker stopped")
async def main():
"""메인 함수"""
worker = ArticleAssemblyWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,5 @@
anthropic==0.50.0
motor==3.1.1
pymongo==4.3.3
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,62 @@
#!/usr/bin/env python3
"""Fix import statements in all pipeline services"""
import os
import re
def fix_imports(filepath):
"""Fix import statements in a Python file"""
with open(filepath, 'r') as f:
content = f.read()
# Pattern to match the old import style
old_pattern = r"# 상위 디렉토리의 shared 모듈 import\nsys\.path\.append\(os\.path\.join\(os\.path\.dirname\(__file__\), '\.\.', 'shared'\)\)\nfrom ([\w, ]+) import ([\w, ]+)"
# Replace with new import style
def replace_imports(match):
modules = match.group(1)
items = match.group(2)
# Build new import statements
imports = []
if 'models' in modules:
imports.append(f"from shared.models import {items}" if 'models' in modules else "")
if 'queue_manager' in modules:
imports.append(f"from shared.queue_manager import QueueManager")
return "# Import from shared module\n" + "\n".join(filter(None, imports))
# Apply the replacement
new_content = re.sub(old_pattern, replace_imports, content)
# Also handle simpler patterns
new_content = new_content.replace(
"sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'shared'))\nfrom models import",
"from shared.models import"
)
new_content = new_content.replace(
"\nfrom queue_manager import",
"\nfrom shared.queue_manager import"
)
# Write back if changed
if new_content != content:
with open(filepath, 'w') as f:
f.write(new_content)
print(f"Fixed imports in {filepath}")
return True
return False
# Files to fix
files_to_fix = [
"monitor/monitor.py",
"google-search/google_search.py",
"article-assembly/article_assembly.py",
"rss-collector/rss_collector.py",
"ai-summarizer/ai_summarizer.py"
]
for file_path in files_to_fix:
full_path = os.path.join(os.path.dirname(__file__), file_path)
if os.path.exists(full_path):
fix_imports(full_path)

View File

@ -0,0 +1,19 @@
FROM python:3.11-slim
WORKDIR /app
# 의존성 설치
COPY ./google-search/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 공통 모듈 복사
COPY ./shared /app/shared
# Google Search 코드 복사
COPY ./google-search /app
# 환경변수
ENV PYTHONUNBUFFERED=1
# 실행
CMD ["python", "google_search.py"]

View File

@ -0,0 +1,153 @@
"""
Google Search Service
Google 검색으로 RSS 항목 강화
"""
import asyncio
import logging
import os
import sys
import json
from typing import List, Dict, Any
import aiohttp
from datetime import datetime
# Import from shared module
from shared.models import PipelineJob, RSSItem, SearchResult, EnrichedItem
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class GoogleSearchWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.google_api_key = os.getenv("GOOGLE_API_KEY")
self.search_engine_id = os.getenv("GOOGLE_SEARCH_ENGINE_ID")
self.max_results_per_item = 3
async def start(self):
"""워커 시작"""
logger.info("Starting Google Search Worker")
# Redis 연결
await self.queue_manager.connect()
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기
job = await self.queue_manager.dequeue('search_enrichment', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""검색 강화 작업 처리"""
try:
logger.info(f"Processing job {job.job_id} for search enrichment")
rss_items = job.data.get('rss_items', [])
enriched_items = []
# 최대 5개 항목만 처리 (API 할당량 관리)
for item_data in rss_items[:5]:
rss_item = RSSItem(**item_data)
# 제목으로 Google 검색
search_results = await self._search_google(rss_item.title)
enriched_item = EnrichedItem(
rss_item=rss_item,
search_results=search_results
)
enriched_items.append(enriched_item)
# API 속도 제한
await asyncio.sleep(0.5)
if enriched_items:
logger.info(f"Enriched {len(enriched_items)} items with search results")
# 다음 단계로 전달
job.data['enriched_items'] = [item.dict() for item in enriched_items]
job.stages_completed.append('search_enrichment')
job.stage = 'ai_summarization'
await self.queue_manager.enqueue('ai_summarization', job)
await self.queue_manager.mark_completed('search_enrichment', job.job_id)
else:
logger.warning(f"No items enriched for job {job.job_id}")
await self.queue_manager.mark_failed(
'search_enrichment',
job,
"No items to enrich"
)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
await self.queue_manager.mark_failed('search_enrichment', job, str(e))
async def _search_google(self, query: str) -> List[SearchResult]:
"""Google Custom Search API 호출"""
results = []
if not self.google_api_key or not self.search_engine_id:
logger.warning("Google API credentials not configured")
return results
try:
url = "https://www.googleapis.com/customsearch/v1"
params = {
"key": self.google_api_key,
"cx": self.search_engine_id,
"q": query,
"num": self.max_results_per_item,
"hl": "ko",
"gl": "kr"
}
async with aiohttp.ClientSession() as session:
async with session.get(url, params=params, timeout=30) as response:
if response.status == 200:
data = await response.json()
for item in data.get('items', []):
result = SearchResult(
title=item.get('title', ''),
link=item.get('link', ''),
snippet=item.get('snippet', ''),
source='google'
)
results.append(result)
else:
logger.error(f"Google API error: {response.status}")
except Exception as e:
logger.error(f"Error searching Google for '{query}': {e}")
return results
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
logger.info("Google Search Worker stopped")
async def main():
"""메인 함수"""
worker = GoogleSearchWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,3 @@
aiohttp==3.9.1
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,15 @@
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY ./image-generator/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy shared modules
COPY ./shared /app/shared
# Copy application code
COPY ./image-generator /app
CMD ["python", "image_generator.py"]

View File

@ -0,0 +1,225 @@
"""
Image Generation Service
Replicate API를 사용한 이미지 생성 서비스
"""
import asyncio
import logging
import os
import sys
import base64
from typing import List, Dict, Any
import httpx
from io import BytesIO
# Import from shared module
from shared.models import PipelineJob, TranslatedItem, GeneratedImageItem
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ImageGeneratorWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.replicate_api_key = os.getenv("REPLICATE_API_KEY")
self.replicate_api_url = "https://api.replicate.com/v1/predictions"
# Stable Diffusion 모델 사용
self.model_version = "stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b"
async def start(self):
"""워커 시작"""
logger.info("Starting Image Generator Worker")
# Redis 연결
await self.queue_manager.connect()
# API 키 확인
if not self.replicate_api_key:
logger.warning("Replicate API key not configured - using placeholder images")
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기
job = await self.queue_manager.dequeue('image_generation', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""이미지 생성 작업 처리"""
try:
logger.info(f"Processing job {job.job_id} for image generation")
translated_items = job.data.get('translated_items', [])
generated_items = []
# 최대 3개 아이템만 이미지 생성 (API 비용 절감)
for idx, item_data in enumerate(translated_items[:3]):
translated_item = TranslatedItem(**item_data)
# 이미지 생성을 위한 프롬프트 생성
prompt = self._create_image_prompt(translated_item)
# 이미지 생성
image_url = await self._generate_image(prompt)
generated_item = GeneratedImageItem(
translated_item=translated_item,
image_url=image_url,
image_prompt=prompt
)
generated_items.append(generated_item)
# API 속도 제한
if self.replicate_api_key:
await asyncio.sleep(2)
if generated_items:
logger.info(f"Generated images for {len(generated_items)} items")
# 완료된 데이터를 job에 저장
job.data['generated_items'] = [item.dict() for item in generated_items]
job.stages_completed.append('image_generation')
job.stage = 'completed'
# 최종 기사 조립 단계로 전달 (이미 article-assembly로 수정)
await self.queue_manager.enqueue('article_assembly', job)
await self.queue_manager.mark_completed('image_generation', job.job_id)
else:
logger.warning(f"No images generated for job {job.job_id}")
# 이미지 생성 실패해도 다음 단계로 진행
job.stages_completed.append('image_generation')
await self.queue_manager.enqueue('article_assembly', job)
await self.queue_manager.mark_completed('image_generation', job.job_id)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
# 이미지 생성 실패해도 다음 단계로 진행
job.stages_completed.append('image_generation')
await self.queue_manager.enqueue('article_assembly', job)
await self.queue_manager.mark_completed('image_generation', job.job_id)
def _create_image_prompt(self, translated_item: TranslatedItem) -> str:
"""이미지 생성을 위한 프롬프트 생성"""
# 영문 제목과 요약을 기반으로 프롬프트 생성
title = translated_item.translated_title or translated_item.summarized_item['enriched_item']['rss_item']['title']
summary = translated_item.translated_summary or translated_item.summarized_item['ai_summary']
# 뉴스 관련 이미지를 위한 프롬프트
prompt = f"News illustration for: {title[:100]}, professional, photorealistic, high quality, 4k"
return prompt
async def _generate_image(self, prompt: str) -> str:
"""Replicate API를 사용한 이미지 생성"""
try:
if not self.replicate_api_key:
# API 키가 없으면 플레이스홀더 이미지 URL 반환
return "https://via.placeholder.com/800x600.png?text=News+Image"
async with httpx.AsyncClient() as client:
# 예측 생성 요청
response = await client.post(
self.replicate_api_url,
headers={
"Authorization": f"Token {self.replicate_api_key}",
"Content-Type": "application/json"
},
json={
"version": self.model_version,
"input": {
"prompt": prompt,
"width": 768,
"height": 768,
"num_outputs": 1,
"scheduler": "K_EULER",
"num_inference_steps": 25,
"guidance_scale": 7.5,
"prompt_strength": 0.8,
"refine": "expert_ensemble_refiner",
"high_noise_frac": 0.8
}
},
timeout=60
)
if response.status_code in [200, 201]:
result = response.json()
prediction_id = result.get('id')
# 예측 결과 폴링
image_url = await self._poll_prediction(prediction_id)
return image_url
else:
logger.error(f"Replicate API error: {response.status_code}")
return "https://via.placeholder.com/800x600.png?text=Generation+Failed"
except Exception as e:
logger.error(f"Error generating image: {e}")
return "https://via.placeholder.com/800x600.png?text=Error"
async def _poll_prediction(self, prediction_id: str, max_attempts: int = 30) -> str:
"""예측 결과 폴링"""
try:
async with httpx.AsyncClient() as client:
for attempt in range(max_attempts):
response = await client.get(
f"{self.replicate_api_url}/{prediction_id}",
headers={
"Authorization": f"Token {self.replicate_api_key}"
},
timeout=30
)
if response.status_code == 200:
result = response.json()
status = result.get('status')
if status == 'succeeded':
output = result.get('output')
if output and isinstance(output, list) and len(output) > 0:
return output[0]
else:
return "https://via.placeholder.com/800x600.png?text=No+Output"
elif status == 'failed':
logger.error(f"Prediction failed: {result.get('error')}")
return "https://via.placeholder.com/800x600.png?text=Failed"
# 아직 처리중이면 대기
await asyncio.sleep(2)
else:
logger.error(f"Error polling prediction: {response.status_code}")
return "https://via.placeholder.com/800x600.png?text=Poll+Error"
# 최대 시도 횟수 초과
return "https://via.placeholder.com/800x600.png?text=Timeout"
except Exception as e:
logger.error(f"Error polling prediction: {e}")
return "https://via.placeholder.com/800x600.png?text=Poll+Exception"
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
logger.info("Image Generator Worker stopped")
async def main():
"""메인 함수"""
worker = ImageGeneratorWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,3 @@
httpx==0.25.0
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,22 @@
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY ./monitor/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy shared modules
COPY ./shared /app/shared
# Copy monitor code
COPY ./monitor /app
# Environment variables
ENV PYTHONUNBUFFERED=1
# Expose port
EXPOSE 8000
# Run
CMD ["uvicorn", "monitor:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]

View File

@ -0,0 +1,349 @@
"""
Pipeline Monitor Service
파이프라인 상태 모니터링 및 대시보드 API
"""
import os
import sys
import logging
from datetime import datetime, timedelta
from typing import Dict, List, Any
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from motor.motor_asyncio import AsyncIOMotorClient
import redis.asyncio as redis
# Import from shared module
from shared.models import KeywordSubscription, PipelineJob, FinalArticle
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI(title="Pipeline Monitor", version="1.0.0")
# CORS 설정
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Global connections
redis_client = None
mongodb_client = None
db = None
@app.on_event("startup")
async def startup_event():
"""서버 시작 시 연결 초기화"""
global redis_client, mongodb_client, db
# Redis 연결
redis_url = os.getenv("REDIS_URL", "redis://redis:6379")
redis_client = await redis.from_url(redis_url, decode_responses=True)
# MongoDB 연결
mongodb_url = os.getenv("MONGODB_URL", "mongodb://mongodb:27017")
mongodb_client = AsyncIOMotorClient(mongodb_url)
db = mongodb_client[os.getenv("DB_NAME", "pipeline_db")]
logger.info("Pipeline Monitor started successfully")
@app.on_event("shutdown")
async def shutdown_event():
"""서버 종료 시 연결 해제"""
if redis_client:
await redis_client.close()
if mongodb_client:
mongodb_client.close()
@app.get("/")
async def root():
"""헬스 체크"""
return {"status": "Pipeline Monitor is running"}
@app.get("/api/stats")
async def get_stats():
"""전체 파이프라인 통계"""
try:
# 큐별 대기 작업 수
queue_stats = {}
queues = [
"queue:keyword",
"queue:rss",
"queue:search",
"queue:summarize",
"queue:assembly"
]
for queue in queues:
length = await redis_client.llen(queue)
queue_stats[queue] = length
# 오늘 생성된 기사 수
today = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
articles_today = await db.articles.count_documents({
"created_at": {"$gte": today}
})
# 활성 키워드 수
active_keywords = await db.keywords.count_documents({
"is_active": True
})
# 총 기사 수
total_articles = await db.articles.count_documents({})
return {
"queues": queue_stats,
"articles_today": articles_today,
"active_keywords": active_keywords,
"total_articles": total_articles,
"timestamp": datetime.now().isoformat()
}
except Exception as e:
logger.error(f"Error getting stats: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/queues/{queue_name}")
async def get_queue_details(queue_name: str):
"""특정 큐의 상세 정보"""
try:
queue_key = f"queue:{queue_name}"
# 큐 길이
length = await redis_client.llen(queue_key)
# 최근 10개 작업 미리보기
items = await redis_client.lrange(queue_key, 0, 9)
# 처리 중인 작업
processing_key = f"processing:{queue_name}"
processing = await redis_client.smembers(processing_key)
# 실패한 작업
failed_key = f"failed:{queue_name}"
failed_count = await redis_client.llen(failed_key)
return {
"queue": queue_name,
"length": length,
"processing_count": len(processing),
"failed_count": failed_count,
"preview": items[:10],
"timestamp": datetime.now().isoformat()
}
except Exception as e:
logger.error(f"Error getting queue details: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/keywords")
async def get_keywords():
"""등록된 키워드 목록"""
try:
keywords = []
cursor = db.keywords.find({"is_active": True})
async for keyword in cursor:
# 해당 키워드의 최근 기사
latest_article = await db.articles.find_one(
{"keyword_id": str(keyword["_id"])},
sort=[("created_at", -1)]
)
keywords.append({
"id": str(keyword["_id"]),
"keyword": keyword["keyword"],
"schedule": keyword.get("schedule", "30분마다"),
"created_at": keyword.get("created_at"),
"last_article": latest_article["created_at"] if latest_article else None,
"article_count": await db.articles.count_documents(
{"keyword_id": str(keyword["_id"])}
)
})
return keywords
except Exception as e:
logger.error(f"Error getting keywords: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.post("/api/keywords")
async def add_keyword(keyword: str, schedule: str = "30min"):
"""새 키워드 등록"""
try:
new_keyword = {
"keyword": keyword,
"schedule": schedule,
"is_active": True,
"created_at": datetime.now(),
"updated_at": datetime.now()
}
result = await db.keywords.insert_one(new_keyword)
return {
"id": str(result.inserted_id),
"keyword": keyword,
"message": "Keyword registered successfully"
}
except Exception as e:
logger.error(f"Error adding keyword: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.delete("/api/keywords/{keyword_id}")
async def delete_keyword(keyword_id: str):
"""키워드 비활성화"""
try:
result = await db.keywords.update_one(
{"_id": keyword_id},
{"$set": {"is_active": False, "updated_at": datetime.now()}}
)
if result.modified_count > 0:
return {"message": "Keyword deactivated successfully"}
else:
raise HTTPException(status_code=404, detail="Keyword not found")
except Exception as e:
logger.error(f"Error deleting keyword: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/articles")
async def get_articles(limit: int = 10, skip: int = 0):
"""최근 생성된 기사 목록"""
try:
articles = []
cursor = db.articles.find().sort("created_at", -1).skip(skip).limit(limit)
async for article in cursor:
articles.append({
"id": str(article["_id"]),
"title": article["title"],
"keyword": article["keyword"],
"summary": article.get("summary", ""),
"created_at": article["created_at"],
"processing_time": article.get("processing_time", 0),
"pipeline_stages": article.get("pipeline_stages", [])
})
total = await db.articles.count_documents({})
return {
"articles": articles,
"total": total,
"limit": limit,
"skip": skip
}
except Exception as e:
logger.error(f"Error getting articles: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/articles/{article_id}")
async def get_article(article_id: str):
"""특정 기사 상세 정보"""
try:
article = await db.articles.find_one({"_id": article_id})
if not article:
raise HTTPException(status_code=404, detail="Article not found")
return article
except Exception as e:
logger.error(f"Error getting article: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/workers")
async def get_workers():
"""워커 상태 정보"""
try:
workers = {}
worker_types = [
"scheduler",
"rss_collector",
"google_search",
"ai_summarizer",
"article_assembly"
]
for worker_type in worker_types:
active_key = f"workers:{worker_type}:active"
active_workers = await redis_client.smembers(active_key)
workers[worker_type] = {
"active": len(active_workers),
"worker_ids": list(active_workers)
}
return workers
except Exception as e:
logger.error(f"Error getting workers: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.post("/api/trigger/{keyword}")
async def trigger_keyword_processing(keyword: str):
"""수동으로 키워드 처리 트리거"""
try:
# 키워드 찾기
keyword_doc = await db.keywords.find_one({
"keyword": keyword,
"is_active": True
})
if not keyword_doc:
raise HTTPException(status_code=404, detail="Keyword not found or inactive")
# 작업 생성
job = PipelineJob(
keyword_id=str(keyword_doc["_id"]),
keyword=keyword,
stage="keyword_processing",
created_at=datetime.now()
)
# 큐에 추가
await redis_client.rpush("queue:keyword", job.json())
return {
"message": f"Processing triggered for keyword: {keyword}",
"job_id": job.job_id
}
except Exception as e:
logger.error(f"Error triggering keyword: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/health")
async def health_check():
"""시스템 헬스 체크"""
try:
# Redis 체크
redis_status = await redis_client.ping()
# MongoDB 체크
mongodb_status = await db.command("ping")
return {
"status": "healthy",
"redis": "connected" if redis_status else "disconnected",
"mongodb": "connected" if mongodb_status else "disconnected",
"timestamp": datetime.now().isoformat()
}
except Exception as e:
return {
"status": "unhealthy",
"error": str(e),
"timestamp": datetime.now().isoformat()
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

View File

@ -0,0 +1,6 @@
fastapi==0.104.1
uvicorn[standard]==0.24.0
redis[hiredis]==5.0.1
motor==3.1.1
pymongo==4.3.3
pydantic==2.5.0

View File

@ -0,0 +1,19 @@
FROM python:3.11-slim
WORKDIR /app
# 의존성 설치
COPY ./rss-collector/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 공통 모듈 복사
COPY ./shared /app/shared
# RSS Collector 코드 복사
COPY ./rss-collector /app
# 환경변수
ENV PYTHONUNBUFFERED=1
# 실행
CMD ["python", "rss_collector.py"]

View File

@ -0,0 +1,4 @@
feedparser==6.0.11
aiohttp==3.9.1
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,192 @@
"""
RSS Collector Service
RSS 피드 수집 및 중복 제거 서비스
"""
import asyncio
import logging
import os
import sys
import hashlib
from datetime import datetime
import feedparser
import aiohttp
import redis.asyncio as redis
from typing import List, Dict, Any
# Import from shared module
from shared.models import PipelineJob, RSSItem, EnrichedItem
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class RSSCollectorWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.redis_client = None
self.redis_url = os.getenv("REDIS_URL", "redis://redis:6379")
self.dedup_ttl = 86400 * 7 # 7일간 중복 방지
self.max_items_per_feed = 10 # 피드당 최대 항목 수
async def start(self):
"""워커 시작"""
logger.info("Starting RSS Collector Worker")
# Redis 연결
await self.queue_manager.connect()
self.redis_client = await redis.from_url(
self.redis_url,
encoding="utf-8",
decode_responses=True
)
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기 (5초 대기)
job = await self.queue_manager.dequeue('rss_collection', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""RSS 수집 작업 처리"""
try:
logger.info(f"Processing job {job.job_id} for keyword '{job.keyword}'")
keyword = job.data.get('keyword', '')
rss_feeds = job.data.get('rss_feeds', [])
# 키워드가 포함된 RSS URL 생성
processed_feeds = self._prepare_feeds(rss_feeds, keyword)
all_items = []
for feed_url in processed_feeds:
try:
items = await self._fetch_rss_feed(feed_url, keyword)
all_items.extend(items)
except Exception as e:
logger.error(f"Error fetching feed {feed_url}: {e}")
if all_items:
# 중복 제거
unique_items = await self._deduplicate_items(all_items, keyword)
if unique_items:
logger.info(f"Collected {len(unique_items)} unique items for '{keyword}'")
# 다음 단계로 전달
job.data['rss_items'] = [item.dict() for item in unique_items]
job.stages_completed.append('rss_collection')
job.stage = 'search_enrichment'
await self.queue_manager.enqueue('search_enrichment', job)
await self.queue_manager.mark_completed('rss_collection', job.job_id)
else:
logger.info(f"No new items found for '{keyword}'")
await self.queue_manager.mark_completed('rss_collection', job.job_id)
else:
logger.warning(f"No RSS items collected for '{keyword}'")
await self.queue_manager.mark_failed(
'rss_collection',
job,
"No RSS items collected"
)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
await self.queue_manager.mark_failed('rss_collection', job, str(e))
def _prepare_feeds(self, feeds: List[str], keyword: str) -> List[str]:
"""RSS 피드 URL 준비 (키워드 치환)"""
processed = []
for feed in feeds:
if '{keyword}' in feed:
processed.append(feed.replace('{keyword}', keyword))
else:
processed.append(feed)
return processed
async def _fetch_rss_feed(self, feed_url: str, keyword: str) -> List[RSSItem]:
"""RSS 피드 가져오기"""
items = []
try:
async with aiohttp.ClientSession() as session:
async with session.get(feed_url, timeout=30) as response:
content = await response.text()
# feedparser로 파싱
feed = feedparser.parse(content)
for entry in feed.entries[:self.max_items_per_feed]:
# 키워드 관련성 체크
title = entry.get('title', '')
summary = entry.get('summary', '')
# 제목이나 요약에 키워드가 포함된 경우만
if keyword.lower() in title.lower() or keyword.lower() in summary.lower():
item = RSSItem(
title=title,
link=entry.get('link', ''),
published=entry.get('published', ''),
summary=summary[:500] if summary else '',
source_feed=feed_url
)
items.append(item)
except Exception as e:
logger.error(f"Error fetching RSS feed {feed_url}: {e}")
return items
async def _deduplicate_items(self, items: List[RSSItem], keyword: str) -> List[RSSItem]:
"""중복 항목 제거"""
unique_items = []
dedup_key = f"dedup:{keyword}"
for item in items:
# 제목 해시 생성
item_hash = hashlib.md5(
f"{keyword}:{item.title}".encode()
).hexdigest()
# Redis Set으로 중복 확인
is_new = await self.redis_client.sadd(dedup_key, item_hash)
if is_new:
unique_items.append(item)
# TTL 설정
if unique_items:
await self.redis_client.expire(dedup_key, self.dedup_ttl)
return unique_items
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
if self.redis_client:
await self.redis_client.close()
logger.info("RSS Collector Worker stopped")
async def main():
"""메인 함수"""
worker = RSSCollectorWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,19 @@
FROM python:3.11-slim
WORKDIR /app
# 의존성 설치
COPY ./scheduler/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 공통 모듈 복사
COPY ./shared /app/shared
# 스케줄러 코드 복사
COPY ./scheduler /app
# 환경변수
ENV PYTHONUNBUFFERED=1
# 실행
CMD ["python", "scheduler.py"]

View File

@ -0,0 +1,5 @@
apscheduler==3.10.4
motor==3.1.1
pymongo==4.3.3
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,203 @@
"""
News Pipeline Scheduler
뉴스 파이프라인 스케줄러 서비스
"""
import asyncio
import logging
import os
import sys
from datetime import datetime, timedelta
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from motor.motor_asyncio import AsyncIOMotorClient
# Import from shared module
from shared.models import KeywordSubscription, PipelineJob
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class NewsScheduler:
def __init__(self):
self.scheduler = AsyncIOScheduler()
self.mongodb_url = os.getenv("MONGODB_URL", "mongodb://mongodb:27017")
self.db_name = os.getenv("DB_NAME", "pipeline_db")
self.db = None
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
async def start(self):
"""스케줄러 시작"""
logger.info("Starting News Pipeline Scheduler")
# MongoDB 연결
client = AsyncIOMotorClient(self.mongodb_url)
self.db = client[self.db_name]
# Redis 연결
await self.queue_manager.connect()
# 기본 스케줄 설정
# 매 30분마다 실행
self.scheduler.add_job(
self.process_keywords,
'interval',
minutes=30,
id='keyword_processor',
name='Process Active Keywords'
)
# 특정 시간대 강화 스케줄 (아침 7시, 점심 12시, 저녁 6시)
for hour in [7, 12, 18]:
self.scheduler.add_job(
self.process_priority_keywords,
'cron',
hour=hour,
minute=0,
id=f'priority_processor_{hour}',
name=f'Process Priority Keywords at {hour}:00'
)
# 매일 자정 통계 초기화
self.scheduler.add_job(
self.reset_daily_stats,
'cron',
hour=0,
minute=0,
id='stats_reset',
name='Reset Daily Statistics'
)
self.scheduler.start()
logger.info("Scheduler started successfully")
# 시작 즉시 한 번 실행
await self.process_keywords()
async def process_keywords(self):
"""활성 키워드 처리"""
try:
logger.info("Processing active keywords")
# MongoDB에서 활성 키워드 로드
now = datetime.now()
thirty_minutes_ago = now - timedelta(minutes=30)
keywords = await self.db.keywords.find({
"is_active": True,
"$or": [
{"last_processed": {"$lt": thirty_minutes_ago}},
{"last_processed": None}
]
}).to_list(None)
logger.info(f"Found {len(keywords)} keywords to process")
for keyword_doc in keywords:
await self._create_job(keyword_doc)
# 처리 시간 업데이트
await self.db.keywords.update_one(
{"keyword_id": keyword_doc['keyword_id']},
{"$set": {"last_processed": now}}
)
logger.info(f"Created jobs for {len(keywords)} keywords")
except Exception as e:
logger.error(f"Error processing keywords: {e}")
async def process_priority_keywords(self):
"""우선순위 키워드 처리"""
try:
logger.info("Processing priority keywords")
keywords = await self.db.keywords.find({
"is_active": True,
"is_priority": True
}).to_list(None)
for keyword_doc in keywords:
await self._create_job(keyword_doc, priority=1)
logger.info(f"Created priority jobs for {len(keywords)} keywords")
except Exception as e:
logger.error(f"Error processing priority keywords: {e}")
async def _create_job(self, keyword_doc: dict, priority: int = 0):
"""파이프라인 작업 생성"""
try:
# KeywordSubscription 모델로 변환
keyword = KeywordSubscription(**keyword_doc)
# PipelineJob 생성
job = PipelineJob(
keyword_id=keyword.keyword_id,
keyword=keyword.keyword,
stage='rss_collection',
stages_completed=[],
priority=priority,
data={
'keyword': keyword.keyword,
'language': keyword.language,
'rss_feeds': keyword.rss_feeds or self._get_default_rss_feeds(),
'categories': keyword.categories
}
)
# 첫 번째 큐에 추가
await self.queue_manager.enqueue(
'rss_collection',
job,
priority=priority
)
logger.info(f"Created job {job.job_id} for keyword '{keyword.keyword}'")
except Exception as e:
logger.error(f"Error creating job for keyword: {e}")
def _get_default_rss_feeds(self) -> list:
"""기본 RSS 피드 목록"""
return [
"https://news.google.com/rss/search?q={keyword}&hl=ko&gl=KR&ceid=KR:ko",
"https://trends.google.com/trends/trendingsearches/daily/rss?geo=KR",
"https://www.mk.co.kr/rss/40300001/", # 매일경제
"https://www.hankyung.com/feed/all-news", # 한국경제
"https://www.zdnet.co.kr/news/news_rss.xml", # ZDNet Korea
]
async def reset_daily_stats(self):
"""일일 통계 초기화"""
try:
logger.info("Resetting daily statistics")
# Redis 통계 초기화
# 구현 필요
pass
except Exception as e:
logger.error(f"Error resetting stats: {e}")
async def stop(self):
"""스케줄러 중지"""
self.scheduler.shutdown()
await self.queue_manager.disconnect()
logger.info("Scheduler stopped")
async def main():
"""메인 함수"""
scheduler = NewsScheduler()
try:
await scheduler.start()
# 계속 실행
while True:
await asyncio.sleep(60)
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await scheduler.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1 @@
# Shared modules for pipeline services

View File

@ -0,0 +1,113 @@
"""
Pipeline Data Models
파이프라인 전체에서 사용되는 공통 데이터 모델
"""
from datetime import datetime
from typing import List, Dict, Any, Optional
from pydantic import BaseModel, Field
import uuid
class KeywordSubscription(BaseModel):
"""키워드 구독 모델"""
keyword_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
keyword: str
language: str = "ko"
schedule: str = "0 */30 * * *" # Cron expression (30분마다)
is_active: bool = True
is_priority: bool = False
last_processed: Optional[datetime] = None
rss_feeds: List[str] = Field(default_factory=list)
categories: List[str] = Field(default_factory=list)
created_at: datetime = Field(default_factory=datetime.now)
owner: Optional[str] = None
class PipelineJob(BaseModel):
"""파이프라인 작업 모델"""
job_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
keyword_id: str
keyword: str
stage: str # current stage
stages_completed: List[str] = Field(default_factory=list)
data: Dict[str, Any] = Field(default_factory=dict)
retry_count: int = 0
max_retries: int = 3
priority: int = 0
created_at: datetime = Field(default_factory=datetime.now)
updated_at: datetime = Field(default_factory=datetime.now)
class RSSItem(BaseModel):
"""RSS 피드 아이템"""
item_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
title: str
link: str
published: Optional[str] = None
summary: Optional[str] = None
source_feed: str
class SearchResult(BaseModel):
"""검색 결과"""
title: str
link: str
snippet: Optional[str] = None
source: str = "google"
class EnrichedItem(BaseModel):
"""강화된 뉴스 아이템"""
rss_item: RSSItem
search_results: List[SearchResult] = Field(default_factory=list)
class SummarizedItem(BaseModel):
"""요약된 아이템"""
enriched_item: EnrichedItem
ai_summary: str
summary_language: str = "ko"
class TranslatedItem(BaseModel):
"""번역된 아이템"""
summarized_item: SummarizedItem
title_en: str
summary_en: str
class ItemWithImage(BaseModel):
"""이미지가 추가된 아이템"""
translated_item: TranslatedItem
image_url: str
image_prompt: str
class FinalArticle(BaseModel):
"""최종 기사"""
article_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
job_id: str
keyword_id: str
keyword: str
title: str
content: str
summary: str
source_items: List[ItemWithImage]
images: List[str]
categories: List[str] = Field(default_factory=list)
tags: List[str] = Field(default_factory=list)
created_at: datetime = Field(default_factory=datetime.now)
pipeline_stages: List[str]
processing_time: float # seconds
class TranslatedItem(BaseModel):
"""번역된 아이템"""
summarized_item: Dict[str, Any] # SummarizedItem as dict
translated_title: str
translated_summary: str
target_language: str = 'en'
class GeneratedImageItem(BaseModel):
"""이미지 생성된 아이템"""
translated_item: Dict[str, Any] # TranslatedItem as dict
image_url: str
image_prompt: str
class QueueMessage(BaseModel):
"""큐 메시지"""
message_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
queue_name: str
job: PipelineJob
timestamp: datetime = Field(default_factory=datetime.now)
retry_count: int = 0

View File

@ -0,0 +1,173 @@
"""
Queue Manager
Redis 기반 큐 관리 시스템
"""
import redis.asyncio as redis
import json
import logging
from typing import Optional, Dict, Any, List
from datetime import datetime
from .models import PipelineJob, QueueMessage
logger = logging.getLogger(__name__)
class QueueManager:
"""Redis 기반 큐 매니저"""
QUEUES = {
"keyword_processing": "queue:keyword",
"rss_collection": "queue:rss",
"search_enrichment": "queue:search",
"ai_summarization": "queue:summarize",
"translation": "queue:translate",
"image_generation": "queue:image",
"article_assembly": "queue:assembly",
"failed": "queue:failed",
"scheduled": "queue:scheduled"
}
def __init__(self, redis_url: str = "redis://redis:6379"):
self.redis_url = redis_url
self.redis_client: Optional[redis.Redis] = None
async def connect(self):
"""Redis 연결"""
if not self.redis_client:
self.redis_client = await redis.from_url(
self.redis_url,
encoding="utf-8",
decode_responses=True
)
logger.info("Connected to Redis")
async def disconnect(self):
"""Redis 연결 해제"""
if self.redis_client:
await self.redis_client.close()
self.redis_client = None
async def enqueue(self, queue_name: str, job: PipelineJob, priority: int = 0) -> str:
"""작업을 큐에 추가"""
try:
queue_key = self.QUEUES.get(queue_name, f"queue:{queue_name}")
message = QueueMessage(
queue_name=queue_name,
job=job
)
# 우선순위에 따라 추가
if priority > 0:
await self.redis_client.lpush(queue_key, message.json())
else:
await self.redis_client.rpush(queue_key, message.json())
# 통계 업데이트
await self.redis_client.hincrby("stats:queues", queue_name, 1)
logger.info(f"Job {job.job_id} enqueued to {queue_name}")
return job.job_id
except Exception as e:
logger.error(f"Failed to enqueue job: {e}")
raise
async def dequeue(self, queue_name: str, timeout: int = 0) -> Optional[PipelineJob]:
"""큐에서 작업 가져오기"""
try:
queue_key = self.QUEUES.get(queue_name, f"queue:{queue_name}")
if timeout > 0:
result = await self.redis_client.blpop(queue_key, timeout=timeout)
if result:
_, data = result
else:
return None
else:
data = await self.redis_client.lpop(queue_key)
if data:
message = QueueMessage.parse_raw(data)
# 처리 중 목록에 추가
processing_key = f"processing:{queue_name}"
await self.redis_client.hset(
processing_key,
message.job.job_id,
message.json()
)
return message.job
return None
except Exception as e:
logger.error(f"Failed to dequeue job: {e}")
return None
async def mark_completed(self, queue_name: str, job_id: str):
"""작업 완료 표시"""
try:
processing_key = f"processing:{queue_name}"
await self.redis_client.hdel(processing_key, job_id)
# 통계 업데이트
await self.redis_client.hincrby("stats:completed", queue_name, 1)
logger.info(f"Job {job_id} completed in {queue_name}")
except Exception as e:
logger.error(f"Failed to mark job as completed: {e}")
async def mark_failed(self, queue_name: str, job: PipelineJob, error: str):
"""작업 실패 처리"""
try:
processing_key = f"processing:{queue_name}"
await self.redis_client.hdel(processing_key, job.job_id)
# 재시도 확인
if job.retry_count < job.max_retries:
job.retry_count += 1
await self.enqueue(queue_name, job)
logger.info(f"Job {job.job_id} requeued (retry {job.retry_count}/{job.max_retries})")
else:
# 실패 큐로 이동
job.data["error"] = error
job.data["failed_stage"] = queue_name
await self.enqueue("failed", job)
# 통계 업데이트
await self.redis_client.hincrby("stats:failed", queue_name, 1)
logger.error(f"Job {job.job_id} failed: {error}")
except Exception as e:
logger.error(f"Failed to mark job as failed: {e}")
async def get_queue_stats(self) -> Dict[str, Any]:
"""큐 통계 조회"""
try:
stats = {}
for name, key in self.QUEUES.items():
stats[name] = {
"pending": await self.redis_client.llen(key),
"processing": await self.redis_client.hlen(f"processing:{name}"),
}
# 완료/실패 통계
stats["completed"] = await self.redis_client.hgetall("stats:completed") or {}
stats["failed"] = await self.redis_client.hgetall("stats:failed") or {}
return stats
except Exception as e:
logger.error(f"Failed to get queue stats: {e}")
return {}
async def clear_queue(self, queue_name: str):
"""큐 초기화 (테스트용)"""
queue_key = self.QUEUES.get(queue_name, f"queue:{queue_name}")
await self.redis_client.delete(queue_key)
await self.redis_client.delete(f"processing:{queue_name}")
logger.info(f"Queue {queue_name} cleared")

View File

@ -0,0 +1,5 @@
redis[hiredis]==5.0.1
motor==3.1.1
pymongo==4.3.3
pydantic==2.5.0
python-dateutil==2.8.2

View File

@ -0,0 +1,15 @@
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY ./translator/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy shared modules
COPY ./shared /app/shared
# Copy application code
COPY ./translator /app
CMD ["python", "translator.py"]

View File

@ -0,0 +1,3 @@
httpx==0.25.0
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,154 @@
"""
Translation Service
DeepL API를 사용한 번역 서비스
"""
import asyncio
import logging
import os
import sys
from typing import List, Dict, Any
import httpx
# Import from shared module
from shared.models import PipelineJob, SummarizedItem, TranslatedItem
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class TranslatorWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.deepl_api_key = os.getenv("DEEPL_API_KEY", "3abbc796-2515-44a8-972d-22dcf27ab54a")
# DeepL Pro API 엔드포인트 사용
self.deepl_api_url = "https://api.deepl.com/v2/translate"
async def start(self):
"""워커 시작"""
logger.info("Starting Translator Worker")
# Redis 연결
await self.queue_manager.connect()
# DeepL API 키 확인
if not self.deepl_api_key:
logger.error("DeepL API key not configured")
return
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기
job = await self.queue_manager.dequeue('translation', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""번역 작업 처리"""
try:
logger.info(f"Processing job {job.job_id} for translation")
summarized_items = job.data.get('summarized_items', [])
translated_items = []
for item_data in summarized_items:
summarized_item = SummarizedItem(**item_data)
# 제목과 요약 번역
translated_title = await self._translate_text(
summarized_item.enriched_item['rss_item']['title'],
target_lang='EN'
)
translated_summary = await self._translate_text(
summarized_item.ai_summary,
target_lang='EN'
)
translated_item = TranslatedItem(
summarized_item=summarized_item,
translated_title=translated_title,
translated_summary=translated_summary,
target_language='en'
)
translated_items.append(translated_item)
# API 속도 제한
await asyncio.sleep(0.5)
if translated_items:
logger.info(f"Translated {len(translated_items)} items")
# 다음 단계로 전달
job.data['translated_items'] = [item.dict() for item in translated_items]
job.stages_completed.append('translation')
job.stage = 'image_generation'
await self.queue_manager.enqueue('image_generation', job)
await self.queue_manager.mark_completed('translation', job.job_id)
else:
logger.warning(f"No items translated for job {job.job_id}")
await self.queue_manager.mark_failed(
'translation',
job,
"No items to translate"
)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
await self.queue_manager.mark_failed('translation', job, str(e))
async def _translate_text(self, text: str, target_lang: str = 'EN') -> str:
"""DeepL API를 사용한 텍스트 번역"""
try:
if not text:
return ""
async with httpx.AsyncClient() as client:
response = await client.post(
self.deepl_api_url,
data={
'auth_key': self.deepl_api_key,
'text': text,
'target_lang': target_lang,
'source_lang': 'KO'
},
timeout=30
)
if response.status_code == 200:
result = response.json()
return result['translations'][0]['text']
else:
logger.error(f"DeepL API error: {response.status_code}")
return text # 번역 실패시 원본 반환
except Exception as e:
logger.error(f"Error translating text: {e}")
return text # 번역 실패시 원본 반환
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
logger.info("Translator Worker stopped")
async def main():
"""메인 함수"""
worker = TranslatorWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())