feat: Implement async queue-based news pipeline with microservices

Major architectural transformation from synchronous to asynchronous processing:

## Pipeline Services (8 microservices)
- pipeline-scheduler: APScheduler for 30-minute periodic job triggers
- pipeline-rss-collector: RSS feed collection with deduplication (7-day TTL)
- pipeline-google-search: Content enrichment via Google Search API
- pipeline-ai-summarizer: AI summarization using Claude API (claude-sonnet-4-20250514)
- pipeline-translator: Translation using DeepL Pro API
- pipeline-image-generator: Image generation with Replicate API (Stable Diffusion)
- pipeline-article-assembly: Final article assembly and MongoDB storage
- pipeline-monitor: Real-time monitoring dashboard (port 8100)

## Key Features
- Redis-based job queue with deduplication
- Asynchronous processing with Python asyncio
- Shared models and queue manager for inter-service communication
- Docker containerization for all services
- Container names standardized with site11_ prefix

## Removed Services
- Moved to backup: google-search, rss-feed, news-aggregator, ai-writer

## Configuration
- DeepL Pro API: 3abbc796-2515-44a8-972d-22dcf27ab54a
- Claude Model: claude-sonnet-4-20250514
- Redis Queue TTL: 7 days for deduplication

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
jungwoo choi
2025-09-13 19:22:14 +09:00
parent 1d90af7c3c
commit 070032006e
73 changed files with 5922 additions and 4 deletions

View File

@ -1,153 +0,0 @@
# Google Search Service
키워드를 구글에서 검색한 결과를 수신하는 서비스입니다.
## 주요 기능
### 1. 다중 검색 방법 지원
- **Google Custom Search API**: 공식 구글 API (권장)
- **SerpAPI**: 대체 검색 API
- **웹 스크래핑**: 폴백 옵션 (제한적)
### 2. 검색 옵션
- 최대 20개 검색 결과 지원
- 언어별/국가별 검색
- 날짜 기준 필터링 및 정렬
- 전체 콘텐츠 가져오기
## API 엔드포인트
### 기본 검색
```
GET /api/search?q=키워드&num=20&lang=ko&country=kr
```
**파라미터:**
- `q`: 검색 키워드 (필수)
- `num`: 결과 개수 (1-20, 기본값: 10)
- `lang`: 언어 코드 (ko, en 등)
- `country`: 국가 코드 (kr, us 등)
- `date_restrict`: 날짜 제한
- `d7`: 일주일 이내
- `m1`: 한달 이내
- `m3`: 3개월 이내
- `y1`: 1년 이내
- `sort_by_date`: 최신순 정렬 (true/false)
### 전체 콘텐츠 검색
```
GET /api/search/full?q=키워드&num=5
```
각 검색 결과 페이지의 전체 내용을 가져옵니다 (시간이 오래 걸릴 수 있음).
### 실시간 트렌딩
```
GET /api/trending?country=kr
```
## 사용 예제
### 1. 한국어 검색 (최신순)
```bash
curl "http://localhost:8016/api/search?q=인공지능&num=20&lang=ko&country=kr&sort_by_date=true"
```
### 2. 영어 검색 (미국)
```bash
curl "http://localhost:8016/api/search?q=artificial%20intelligence&num=10&lang=en&country=us"
```
### 3. 최근 일주일 내 결과만
```bash
curl "http://localhost:8016/api/search?q=뉴스&date_restrict=d7&lang=ko"
```
### 4. 전체 콘텐츠 가져오기
```bash
curl "http://localhost:8016/api/search/full?q=python%20tutorial&num=3"
```
## 환경 설정
### 필수 API 키 설정
1. **Google Custom Search API**
- [Google Cloud Console](https://console.cloud.google.com/apis/credentials)에서 API 키 발급
- [Programmable Search Engine](https://programmablesearchengine.google.com/)에서 검색 엔진 ID 생성
2. **SerpAPI (선택사항)**
- [SerpAPI](https://serpapi.com/)에서 API 키 발급
### .env 파일 설정
```env
# Google Custom Search API
GOOGLE_API_KEY=your_api_key_here
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id_here
# SerpAPI (선택사항)
SERPAPI_KEY=your_serpapi_key_here
# Redis 캐시
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_DB=2
# 기본 설정
DEFAULT_LANGUAGE=ko
DEFAULT_COUNTRY=kr
CACHE_TTL=3600
```
## Docker 실행
```bash
# 빌드 및 실행
docker-compose build google-search-backend
docker-compose up -d google-search-backend
# 로그 확인
docker-compose logs -f google-search-backend
```
## 제한 사항
### Google Custom Search API
- 무료 계정: 일일 100회 쿼리 제한
- 검색당 최대 100개 결과
- snippet 길이는 서버에서 제한 (변경 불가)
### 해결 방법
- 20개 이상 결과 필요 시: 페이지네이션 사용
- 긴 내용 필요 시: `/api/search/full` 엔드포인트 사용
- API 제한 도달 시: SerpAPI 또는 웹 스크래핑으로 자동 폴백
## 캐시 관리
Redis를 사용하여 검색 결과를 캐싱합니다:
- 기본 TTL: 3600초 (1시간)
- 캐시 초기화: `POST /api/clear-cache`
## 헬스 체크
```bash
curl http://localhost:8016/health
```
## 문제 해결
### 1. 한글 검색 안될 때
URL 인코딩 사용:
```bash
# "인공지능" → %EC%9D%B8%EA%B3%B5%EC%A7%80%EB%8A%A5
curl "http://localhost:8016/api/search?q=%EC%9D%B8%EA%B3%B5%EC%A7%80%EB%8A%A5"
```
### 2. API 제한 에러
- Google API 일일 제한 확인
- SerpAPI 키 설정으로 대체
- 웹 스크래핑 자동 폴백 활용
### 3. 느린 응답 시간
- Redis 캐시 활성화 확인
- 결과 개수 줄이기
- 전체 콘텐츠 대신 기본 검색 사용

View File

@ -1,21 +0,0 @@
# Google Custom Search API Configuration
# Get your API key from: https://console.cloud.google.com/apis/credentials
GOOGLE_API_KEY=
# Get your Search Engine ID from: https://programmablesearchengine.google.com/
GOOGLE_SEARCH_ENGINE_ID=
# Alternative: SerpAPI Configuration
# Get your API key from: https://serpapi.com/
SERPAPI_KEY=
# Redis Configuration
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_DB=2
# Search Settings
DEFAULT_LANGUAGE=ko
DEFAULT_COUNTRY=kr
CACHE_TTL=3600
MAX_RESULTS=10

View File

@ -1,10 +0,0 @@
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]

View File

@ -1,30 +0,0 @@
from pydantic_settings import BaseSettings
from typing import Optional
class Settings(BaseSettings):
# Google Custom Search API 설정
google_api_key: Optional[str] = None
google_search_engine_id: Optional[str] = None
# SerpAPI 설정 (대안)
serpapi_key: Optional[str] = None
# Redis 캐싱 설정
redis_host: str = "redis"
redis_port: int = 6379
redis_db: int = 2
cache_ttl: int = 3600 # 1시간
# 검색 설정
max_results: int = 10
default_language: str = "ko"
default_country: str = "kr"
# 서비스 설정
service_name: str = "Google Search Service"
debug: bool = True
class Config:
env_file = ".env"
settings = Settings()

View File

@ -1,188 +0,0 @@
from fastapi import FastAPI, Query, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from typing import Optional
from datetime import datetime
from contextlib import asynccontextmanager
from .search_service import GoogleSearchService
from .config import settings
@asynccontextmanager
async def lifespan(app: FastAPI):
# 시작 시
print("Google Search Service starting...")
yield
# 종료 시
print("Google Search Service stopping...")
app = FastAPI(
title="Google Search Service",
description="구글 검색 결과를 수신하는 서비스",
version="1.0.0",
lifespan=lifespan
)
# CORS 설정
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# 검색 서비스 초기화
search_service = GoogleSearchService()
@app.get("/")
async def root():
return {
"service": "Google Search Service",
"version": "1.0.0",
"timestamp": datetime.now().isoformat(),
"endpoints": {
"search": "/api/search?q=keyword",
"custom_search": "/api/search/custom?q=keyword",
"serpapi_search": "/api/search/serpapi?q=keyword",
"scraping_search": "/api/search/scraping?q=keyword",
"trending": "/api/trending",
"health": "/health"
}
}
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"service": "google-search",
"timestamp": datetime.now().isoformat()
}
@app.get("/api/search")
async def search(
q: str = Query(..., description="검색 키워드"),
num: int = Query(10, description="결과 개수", ge=1, le=20),
lang: Optional[str] = Query(None, description="언어 코드 (ko, en 등)"),
country: Optional[str] = Query(None, description="국가 코드 (kr, us 등)"),
date_restrict: Optional[str] = Query(None, description="날짜 제한 (d7=일주일, m1=한달, m3=3개월, y1=1년)"),
sort_by_date: bool = Query(False, description="최신순 정렬")
):
"""
자동으로 최적의 방법을 선택하여 구글 검색
1. Google Custom Search API (설정된 경우)
2. SerpAPI (설정된 경우)
3. 웹 스크래핑 (폴백)
"""
# Google Custom Search API 시도
if settings.google_api_key and settings.google_search_engine_id:
result = await search_service.search_with_custom_api(q, num, lang, country, date_restrict, sort_by_date)
if "error" not in result or not result["error"]:
result["method"] = "google_custom_search"
return result
# SerpAPI 시도
if settings.serpapi_key:
result = await search_service.search_with_serpapi(q, num, lang, country)
if "error" not in result or not result["error"]:
result["method"] = "serpapi"
return result
# 웹 스크래핑 폴백
result = await search_service.search_with_scraping(q, num, lang)
result["method"] = "web_scraping"
result["warning"] = "API 키가 설정되지 않아 웹 스크래핑을 사용합니다. 제한적이고 불안정할 수 있습니다."
return result
@app.get("/api/search/custom")
async def search_custom(
q: str = Query(..., description="검색 키워드"),
num: int = Query(10, description="결과 개수", ge=1, le=10),
lang: Optional[str] = Query(None, description="언어 코드"),
country: Optional[str] = Query(None, description="국가 코드")
):
"""Google Custom Search API를 사용한 검색"""
if not settings.google_api_key or not settings.google_search_engine_id:
raise HTTPException(
status_code=503,
detail="Google Custom Search API credentials not configured"
)
result = await search_service.search_with_custom_api(q, num, lang, country)
if "error" in result and result["error"]:
raise HTTPException(status_code=500, detail=result["error"])
return result
@app.get("/api/search/serpapi")
async def search_serpapi(
q: str = Query(..., description="검색 키워드"),
num: int = Query(10, description="결과 개수", ge=1, le=50),
lang: Optional[str] = Query(None, description="언어 코드"),
country: Optional[str] = Query(None, description="국가 코드")
):
"""SerpAPI를 사용한 검색"""
if not settings.serpapi_key:
raise HTTPException(
status_code=503,
detail="SerpAPI key not configured"
)
result = await search_service.search_with_serpapi(q, num, lang, country)
if "error" in result and result["error"]:
raise HTTPException(status_code=500, detail=result["error"])
return result
@app.get("/api/search/scraping")
async def search_scraping(
q: str = Query(..., description="검색 키워드"),
num: int = Query(10, description="결과 개수", ge=1, le=20),
lang: Optional[str] = Query(None, description="언어 코드")
):
"""웹 스크래핑을 사용한 검색 (제한적)"""
result = await search_service.search_with_scraping(q, num, lang)
if "error" in result and result["error"]:
raise HTTPException(status_code=500, detail=result["error"])
result["warning"] = "웹 스크래핑은 제한적이고 불안정할 수 있습니다"
return result
@app.get("/api/search/full")
async def search_with_full_content(
q: str = Query(..., description="검색 키워드"),
num: int = Query(5, description="결과 개수", ge=1, le=10),
lang: Optional[str] = Query(None, description="언어 코드 (ko, en 등)"),
country: Optional[str] = Query(None, description="국가 코드 (kr, us 등)")
):
"""
Google 검색 후 각 결과 페이지의 전체 내용을 가져오기
주의: 시간이 오래 걸릴 수 있음
"""
result = await search_service.search_with_full_content(q, num, lang, country)
if "error" in result and result["error"]:
raise HTTPException(status_code=500, detail=result["error"])
return result
@app.get("/api/trending")
async def get_trending(
country: Optional[str] = Query(None, description="국가 코드 (kr, us 등)")
):
"""실시간 트렌딩 검색어 조회"""
result = await search_service.get_trending_searches(country)
if "error" in result and result["error"]:
raise HTTPException(status_code=500, detail=result["error"])
return result
@app.post("/api/clear-cache")
async def clear_cache():
"""캐시 초기화"""
try:
search_service.redis_client.flushdb()
return {
"status": "success",
"message": "캐시가 초기화되었습니다"
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

View File

@ -1,540 +0,0 @@
import httpx
import json
import redis
from typing import List, Dict, Optional
from datetime import datetime
import hashlib
from bs4 import BeautifulSoup
from .config import settings
class GoogleSearchService:
def __init__(self):
# Redis 연결
self.redis_client = redis.Redis(
host=settings.redis_host,
port=settings.redis_port,
db=settings.redis_db,
decode_responses=True
)
def _get_cache_key(self, query: str, **kwargs) -> str:
"""캐시 키 생성"""
cache_data = f"{query}_{kwargs}"
return f"google_search:{hashlib.md5(cache_data.encode()).hexdigest()}"
async def search_with_custom_api(
self,
query: str,
num_results: int = 10,
language: str = None,
country: str = None,
date_restrict: str = None,
sort_by_date: bool = False
) -> Dict:
"""Google Custom Search API 사용"""
if not settings.google_api_key or not settings.google_search_engine_id:
return {
"error": "Google API credentials not configured",
"results": []
}
# 캐시 확인
cache_key = self._get_cache_key(query, num=num_results, lang=language, country=country)
cached = self.redis_client.get(cache_key)
if cached:
return json.loads(cached)
url = "https://www.googleapis.com/customsearch/v1"
all_results = []
total_results_info = None
# Google API는 한 번에 최대 10개만 반환, 20개를 원하면 2번 요청
num_requests = min((num_results + 9) // 10, 2) # 최대 2번 요청 (20개까지)
async with httpx.AsyncClient() as client:
for page in range(num_requests):
start_index = page * 10 + 1
current_num = min(10, num_results - page * 10)
params = {
"key": settings.google_api_key,
"cx": settings.google_search_engine_id,
"q": query,
"num": current_num,
"start": start_index, # 시작 인덱스
"hl": language or settings.default_language,
"gl": country or settings.default_country
}
# 날짜 제한 추가 (d7 = 일주일, m1 = 한달, y1 = 1년)
if date_restrict:
params["dateRestrict"] = date_restrict
# 날짜순 정렬 (Google Custom Search API에서는 sort=date 옵션)
if sort_by_date:
params["sort"] = "date"
try:
response = await client.get(url, params=params)
response.raise_for_status()
data = response.json()
# 첫 번째 요청에서만 전체 정보 저장
if page == 0:
total_results_info = {
"total_results": data.get("searchInformation", {}).get("totalResults"),
"search_time": data.get("searchInformation", {}).get("searchTime"),
"query": data.get("queries", {}).get("request", [{}])[0].get("searchTerms")
}
# 결과 추가
for item in data.get("items", []):
all_results.append({
"title": item.get("title"),
"link": item.get("link"),
"snippet": item.get("snippet"),
"display_link": item.get("displayLink"),
"thumbnail": item.get("pagemap", {}).get("cse_thumbnail", [{}])[0].get("src") if "pagemap" in item else None
})
except Exception as e:
# 첫 번째 요청이 실패하면 에러 반환
if page == 0:
return {
"error": str(e),
"results": []
}
# 두 번째 요청이 실패하면 첫 번째 결과만 반환
break
results = {
"query": total_results_info.get("query") if total_results_info else query,
"total_results": total_results_info.get("total_results") if total_results_info else "0",
"search_time": total_results_info.get("search_time") if total_results_info else 0,
"results": all_results[:num_results], # 요청한 개수만큼만 반환
"timestamp": datetime.utcnow().isoformat()
}
# 캐시 저장
self.redis_client.setex(
cache_key,
settings.cache_ttl,
json.dumps(results)
)
return results
async def search_with_serpapi(
self,
query: str,
num_results: int = 10,
language: str = None,
country: str = None
) -> Dict:
"""SerpAPI 사용 (유료 서비스)"""
if not settings.serpapi_key:
return {
"error": "SerpAPI key not configured",
"results": []
}
# 캐시 확인
cache_key = self._get_cache_key(query, num=num_results, lang=language, country=country)
cached = self.redis_client.get(cache_key)
if cached:
return json.loads(cached)
from serpapi import GoogleSearch
params = {
"q": query,
"api_key": settings.serpapi_key,
"num": num_results,
"hl": language or settings.default_language,
"gl": country or settings.default_country
}
try:
search = GoogleSearch(params)
results = search.get_dict()
formatted_results = self._format_serpapi_results(results)
# 캐시 저장
self.redis_client.setex(
cache_key,
settings.cache_ttl,
json.dumps(formatted_results)
)
return formatted_results
except Exception as e:
return {
"error": str(e),
"results": []
}
async def search_with_scraping(
self,
query: str,
num_results: int = 10,
language: str = None
) -> Dict:
"""웹 스크래핑으로 검색 (비추천, 제한적)"""
# 캐시 확인
cache_key = self._get_cache_key(query, num=num_results, lang=language)
cached = self.redis_client.get(cache_key)
if cached:
return json.loads(cached)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
params = {
"q": query,
"num": num_results,
"hl": language or settings.default_language
}
async with httpx.AsyncClient() as client:
try:
response = await client.get(
"https://www.google.com/search",
params=params,
headers=headers,
follow_redirects=True
)
soup = BeautifulSoup(response.text, 'html.parser')
results = self._parse_google_html(soup)
formatted_results = {
"query": query,
"total_results": len(results),
"results": results,
"timestamp": datetime.utcnow().isoformat()
}
# 캐시 저장
self.redis_client.setex(
cache_key,
settings.cache_ttl,
json.dumps(formatted_results)
)
return formatted_results
except Exception as e:
return {
"error": str(e),
"results": []
}
def _format_google_results(self, data: Dict) -> Dict:
"""Google API 결과 포맷팅"""
results = []
for item in data.get("items", []):
results.append({
"title": item.get("title"),
"link": item.get("link"),
"snippet": item.get("snippet"),
"display_link": item.get("displayLink"),
"thumbnail": item.get("pagemap", {}).get("cse_thumbnail", [{}])[0].get("src") if "pagemap" in item else None
})
return {
"query": data.get("queries", {}).get("request", [{}])[0].get("searchTerms"),
"total_results": data.get("searchInformation", {}).get("totalResults"),
"search_time": data.get("searchInformation", {}).get("searchTime"),
"results": results,
"timestamp": datetime.utcnow().isoformat()
}
def _format_serpapi_results(self, data: Dict) -> Dict:
"""SerpAPI 결과 포맷팅"""
results = []
for item in data.get("organic_results", []):
results.append({
"title": item.get("title"),
"link": item.get("link"),
"snippet": item.get("snippet"),
"position": item.get("position"),
"thumbnail": item.get("thumbnail"),
"date": item.get("date")
})
# 관련 검색어
related_searches = [
item.get("query") for item in data.get("related_searches", [])
]
return {
"query": data.get("search_parameters", {}).get("q"),
"total_results": data.get("search_information", {}).get("total_results"),
"search_time": data.get("search_information", {}).get("time_taken_displayed"),
"results": results,
"related_searches": related_searches,
"timestamp": datetime.utcnow().isoformat()
}
def _parse_google_html(self, soup: BeautifulSoup) -> List[Dict]:
"""HTML 파싱으로 검색 결과 추출"""
results = []
# 검색 결과 컨테이너 찾기
for g in soup.find_all('div', class_='g'):
anchors = g.find_all('a')
if anchors:
link = anchors[0].get('href', '')
title_elem = g.find('h3')
snippet_elem = g.find('span', class_='st') or g.find('div', class_='s')
if title_elem and link:
results.append({
"title": title_elem.get_text(),
"link": link,
"snippet": snippet_elem.get_text() if snippet_elem else ""
})
return results
async def fetch_page_content(self, url: str) -> Dict:
"""웹 페이지의 전체 내용을 가져오기"""
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
async with httpx.AsyncClient(timeout=10.0) as client:
response = await client.get(url, headers=headers, follow_redirects=True)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# 불필요한 태그 제거
for script in soup(["script", "style", "nav", "header", "footer"]):
script.decompose()
# 본문 내용 추출 시도
main_content = None
# 1. article 태그 찾기
article = soup.find('article')
if article:
main_content = article.get_text()
# 2. main 태그 찾기
if not main_content:
main = soup.find('main')
if main:
main_content = main.get_text()
# 3. 일반적인 콘텐츠 div 찾기
if not main_content:
content_divs = soup.find_all('div', class_=lambda x: x and ('content' in x.lower() or 'article' in x.lower() or 'post' in x.lower()))
if content_divs:
main_content = ' '.join([div.get_text() for div in content_divs[:3]])
# 4. 전체 body에서 텍스트 추출
if not main_content:
body = soup.find('body')
if body:
main_content = body.get_text()
else:
main_content = soup.get_text()
# 텍스트 정리
main_content = ' '.join(main_content.split())
# 제목 추출
title = soup.find('title')
title_text = title.get_text() if title else ""
# 메타 설명 추출
meta_desc = soup.find('meta', attrs={'name': 'description'})
description = meta_desc.get('content', '') if meta_desc else ""
return {
"url": url,
"title": title_text,
"description": description,
"content": main_content[:5000], # 최대 5000자
"content_length": len(main_content),
"success": True
}
except Exception as e:
return {
"url": url,
"error": str(e),
"success": False
}
async def search_with_extended_snippet(
self,
query: str,
num_results: int = 10,
language: str = None,
country: str = None
) -> Dict:
"""검색 후 확장된 snippet 가져오기 (메타 설명 + 첫 500자)"""
# 먼저 일반 검색 수행
search_results = await self.search_with_custom_api(
query, num_results, language, country
)
if "error" in search_results:
return search_results
# 각 결과의 확장된 snippet 가져오기
import asyncio
async def fetch_extended_snippet(result):
"""개별 페이지의 확장된 snippet 가져오기"""
enhanced_result = result.copy()
if result.get("link"):
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
async with httpx.AsyncClient(timeout=5.0) as client:
response = await client.get(result["link"], headers=headers, follow_redirects=True)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# 메타 설명 추출
meta_desc = soup.find('meta', attrs={'name': 'description'})
if not meta_desc:
meta_desc = soup.find('meta', attrs={'property': 'og:description'})
description = meta_desc.get('content', '') if meta_desc else ""
# 본문 첫 부분 추출
for script in soup(["script", "style"]):
script.decompose()
# 본문 텍스트 찾기
text_content = ""
for tag in ['article', 'main', 'div']:
elements = soup.find_all(tag)
for elem in elements:
text = elem.get_text().strip()
if len(text) > 200: # 의미있는 텍스트만
text_content = ' '.join(text.split())[:1000]
break
if text_content:
break
# 기존 snippet과 병합
extended_snippet = result.get("snippet", "")
if description and description not in extended_snippet:
extended_snippet = description + " ... " + extended_snippet
if text_content and len(extended_snippet) < 500:
extended_snippet = extended_snippet + " ... " + text_content[:500-len(extended_snippet)]
enhanced_result["snippet"] = extended_snippet[:1000] # 최대 1000자
enhanced_result["extended"] = True
except Exception as e:
# 실패 시 원본 snippet 유지
enhanced_result["extended"] = False
enhanced_result["fetch_error"] = str(e)
return enhanced_result
# 병렬로 모든 페이지 처리
tasks = [fetch_extended_snippet(result) for result in search_results.get("results", [])]
enhanced_results = await asyncio.gather(*tasks)
return {
**search_results,
"results": enhanced_results,
"snippet_extended": True
}
async def search_with_full_content(
self,
query: str,
num_results: int = 5,
language: str = None,
country: str = None
) -> Dict:
"""검색 후 각 결과의 전체 내용 가져오기"""
# 먼저 일반 검색 수행
search_results = await self.search_with_custom_api(
query, num_results, language, country
)
if "error" in search_results:
return search_results
# 각 결과의 전체 내용 가져오기
enhanced_results = []
for result in search_results.get("results", [])[:num_results]:
# 원본 검색 결과 복사
enhanced_result = result.copy()
# 페이지 내용 가져오기
if result.get("link"):
content_data = await self.fetch_page_content(result["link"])
enhanced_result["full_content"] = content_data
enhanced_results.append(enhanced_result)
return {
**search_results,
"results": enhanced_results,
"content_fetched": True
}
async def get_trending_searches(self, country: str = None) -> Dict:
"""트렌딩 검색어 가져오기"""
# Google Trends 비공식 API 사용
url = f"https://trends.google.com/trends/api/dailytrends"
params = {
"geo": country or settings.default_country.upper()
}
async with httpx.AsyncClient() as client:
try:
response = await client.get(url, params=params)
# Google Trends API는 ")]}',\n"로 시작하는 응답을 반환
json_data = response.text[6:]
data = json.loads(json_data)
trending = []
for date_data in data.get("default", {}).get("trendingSearchesDays", []):
for search in date_data.get("trendingSearches", []):
trending.append({
"title": search.get("title", {}).get("query"),
"traffic": search.get("formattedTraffic"),
"articles": [
{
"title": article.get("title"),
"url": article.get("url"),
"source": article.get("source")
}
for article in search.get("articles", [])[:3]
]
})
return {
"country": country or settings.default_country,
"trending": trending[:10],
"timestamp": datetime.utcnow().isoformat()
}
except Exception as e:
return {
"error": str(e),
"trending": []
}

View File

@ -1,9 +0,0 @@
fastapi==0.109.0
uvicorn[standard]==0.27.0
httpx==0.26.0
pydantic==2.5.3
pydantic-settings==2.1.0
google-api-python-client==2.108.0
beautifulsoup4==4.12.2
redis==5.0.1
serpapi==0.1.5

View File

@ -0,0 +1,90 @@
# Pipeline Makefile
.PHONY: help build up down restart logs clean test monitor
help:
@echo "Pipeline Management Commands:"
@echo " make build - Build all Docker images"
@echo " make up - Start all services"
@echo " make down - Stop all services"
@echo " make restart - Restart all services"
@echo " make logs - View logs for all services"
@echo " make clean - Clean up containers and volumes"
@echo " make monitor - Open monitor dashboard"
@echo " make test - Test pipeline with sample keyword"
build:
docker-compose build
up:
docker-compose up -d
down:
docker-compose down
restart:
docker-compose restart
logs:
docker-compose logs -f
clean:
docker-compose down -v
docker system prune -f
monitor:
@echo "Opening monitor dashboard..."
@echo "Dashboard: http://localhost:8100"
@echo "API Docs: http://localhost:8100/docs"
test:
@echo "Testing pipeline with sample keyword..."
curl -X POST http://localhost:8100/api/keywords \
-H "Content-Type: application/json" \
-d '{"keyword": "테스트", "schedule": "30min"}'
@echo "\nTriggering immediate processing..."
curl -X POST http://localhost:8100/api/trigger/테스트
# Service-specific commands
scheduler-logs:
docker-compose logs -f scheduler
rss-logs:
docker-compose logs -f rss-collector
search-logs:
docker-compose logs -f google-search
summarizer-logs:
docker-compose logs -f ai-summarizer
assembly-logs:
docker-compose logs -f article-assembly
monitor-logs:
docker-compose logs -f monitor
# Database commands
redis-cli:
docker-compose exec redis redis-cli
mongo-shell:
docker-compose exec mongodb mongosh -u admin -p password123
# Queue management
queue-status:
@echo "Checking queue status..."
docker-compose exec redis redis-cli --raw LLEN queue:keyword
docker-compose exec redis redis-cli --raw LLEN queue:rss
docker-compose exec redis redis-cli --raw LLEN queue:search
docker-compose exec redis redis-cli --raw LLEN queue:summarize
docker-compose exec redis redis-cli --raw LLEN queue:assembly
queue-clear:
@echo "Clearing all queues..."
docker-compose exec redis redis-cli FLUSHDB
# Health check
health:
@echo "Checking service health..."
curl -s http://localhost:8100/api/health | python3 -m json.tool

154
services/pipeline/README.md Normal file
View File

@ -0,0 +1,154 @@
# News Pipeline System
비동기 큐 기반 뉴스 생성 파이프라인 시스템
## 아키텍처
```
Scheduler → RSS Collector → Google Search → AI Summarizer → Article Assembly → MongoDB
↓ ↓ ↓ ↓ ↓
Redis Queue Redis Queue Redis Queue Redis Queue Redis Queue
```
## 서비스 구성
### 1. Scheduler
- 30분마다 등록된 키워드 처리
- 오전 7시, 낮 12시, 저녁 6시 우선 처리
- MongoDB에서 키워드 로드 후 큐에 작업 생성
### 2. RSS Collector
- RSS 피드 수집 (Google News RSS)
- 7일간 중복 방지 (Redis Set)
- 키워드 관련성 필터링
### 3. Google Search
- RSS 아이템별 추가 검색 결과 수집
- 아이템당 최대 3개 결과
- 작업당 최대 5개 아이템 처리
### 4. AI Summarizer
- Claude Haiku로 빠른 요약 생성
- 200자 이내 한국어 요약
- 병렬 처리 지원 (3 workers)
### 5. Article Assembly
- Claude Sonnet으로 종합 기사 작성
- 1500자 이내 전문 기사
- MongoDB 저장 및 통계 업데이트
### 6. Monitor
- 실시간 파이프라인 모니터링
- 큐 상태, 워커 상태 확인
- REST API 제공 (포트 8100)
## 시작하기
### 1. 환경 변수 설정
```bash
# .env 파일 확인
CLAUDE_API_KEY=your_claude_api_key
GOOGLE_API_KEY=your_google_api_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id
```
### 2. 서비스 시작
```bash
cd pipeline
docker-compose up -d
```
### 3. 모니터링
```bash
# 로그 확인
docker-compose logs -f
# 특정 서비스 로그
docker-compose logs -f scheduler
# 모니터 API
curl http://localhost:8100/api/stats
```
## API 엔드포인트
### Monitor API (포트 8100)
- `GET /api/stats` - 전체 통계
- `GET /api/queues/{queue_name}` - 큐 상세 정보
- `GET /api/keywords` - 키워드 목록
- `POST /api/keywords` - 키워드 등록
- `DELETE /api/keywords/{id}` - 키워드 삭제
- `GET /api/articles` - 기사 목록
- `GET /api/articles/{id}` - 기사 상세
- `GET /api/workers` - 워커 상태
- `POST /api/trigger/{keyword}` - 수동 처리 트리거
- `GET /api/health` - 헬스 체크
## 키워드 등록 예시
```bash
# 새 키워드 등록
curl -X POST http://localhost:8100/api/keywords \
-H "Content-Type: application/json" \
-d '{"keyword": "인공지능", "schedule": "30min"}'
# 수동 처리 트리거
curl -X POST http://localhost:8100/api/trigger/인공지능
```
## 데이터베이스
### MongoDB Collections
- `keywords` - 등록된 키워드
- `articles` - 생성된 기사
- `keyword_stats` - 키워드별 통계
### Redis Keys
- `queue:*` - 작업 큐
- `processing:*` - 처리 중 작업
- `failed:*` - 실패한 작업
- `dedup:rss:*` - RSS 중복 방지
- `workers:*:active` - 활성 워커
## 트러블슈팅
### 큐 초기화
```bash
docker-compose exec redis redis-cli FLUSHDB
```
### 워커 재시작
```bash
docker-compose restart rss-collector
```
### 데이터베이스 접속
```bash
# MongoDB
docker-compose exec mongodb mongosh -u admin -p password123
# Redis
docker-compose exec redis redis-cli
```
## 스케일링
워커 수 조정:
```yaml
# docker-compose.yml
ai-summarizer:
deploy:
replicas: 5 # 워커 수 증가
```
## 모니터링 대시보드
브라우저에서 http://localhost:8100 접속하여 파이프라인 상태 확인
## 로그 레벨 설정
`.env` 파일에서 조정:
```
LOG_LEVEL=DEBUG # INFO, WARNING, ERROR
```

View File

@ -0,0 +1,19 @@
FROM python:3.11-slim
WORKDIR /app
# 의존성 설치
COPY ./ai-summarizer/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 공통 모듈 복사
COPY ./shared /app/shared
# AI Summarizer 코드 복사
COPY ./ai-summarizer /app
# 환경변수
ENV PYTHONUNBUFFERED=1
# 실행
CMD ["python", "ai_summarizer.py"]

View File

@ -0,0 +1,161 @@
"""
AI Summarizer Service
Claude API를 사용한 뉴스 요약 서비스
"""
import asyncio
import logging
import os
import sys
from typing import List, Dict, Any
from anthropic import AsyncAnthropic
# Import from shared module
from shared.models import PipelineJob, EnrichedItem, SummarizedItem
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class AISummarizerWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.claude_api_key = os.getenv("CLAUDE_API_KEY")
self.claude_client = None
async def start(self):
"""워커 시작"""
logger.info("Starting AI Summarizer Worker")
# Redis 연결
await self.queue_manager.connect()
# Claude 클라이언트 초기화
if self.claude_api_key:
self.claude_client = AsyncAnthropic(api_key=self.claude_api_key)
else:
logger.error("Claude API key not configured")
return
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기
job = await self.queue_manager.dequeue('ai_summarization', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""AI 요약 작업 처리"""
try:
logger.info(f"Processing job {job.job_id} for AI summarization")
enriched_items = job.data.get('enriched_items', [])
summarized_items = []
for item_data in enriched_items:
enriched_item = EnrichedItem(**item_data)
# AI 요약 생성
summary = await self._generate_summary(enriched_item)
summarized_item = SummarizedItem(
enriched_item=enriched_item,
ai_summary=summary,
summary_language='ko'
)
summarized_items.append(summarized_item)
# API 속도 제한
await asyncio.sleep(1)
if summarized_items:
logger.info(f"Summarized {len(summarized_items)} items")
# 다음 단계로 전달 (번역 단계로)
job.data['summarized_items'] = [item.dict() for item in summarized_items]
job.stages_completed.append('ai_summarization')
job.stage = 'translation'
await self.queue_manager.enqueue('translation', job)
await self.queue_manager.mark_completed('ai_summarization', job.job_id)
else:
logger.warning(f"No items summarized for job {job.job_id}")
await self.queue_manager.mark_failed(
'ai_summarization',
job,
"No items to summarize"
)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
await self.queue_manager.mark_failed('ai_summarization', job, str(e))
async def _generate_summary(self, enriched_item: EnrichedItem) -> str:
"""Claude를 사용한 요약 생성"""
try:
# 컨텐츠 준비
content_parts = [
f"제목: {enriched_item.rss_item.title}",
f"요약: {enriched_item.rss_item.summary or '없음'}"
]
# 검색 결과 추가
if enriched_item.search_results:
content_parts.append("\n관련 검색 결과:")
for idx, result in enumerate(enriched_item.search_results[:3], 1):
content_parts.append(f"{idx}. {result.title}")
if result.snippet:
content_parts.append(f" {result.snippet}")
content = "\n".join(content_parts)
# Claude API 호출
prompt = f"""다음 뉴스 내용을 200자 이내로 핵심만 요약해주세요.
중요한 사실, 수치, 인물, 조직을 포함하고 객관적인 톤을 유지하세요.
{content}
요약:"""
response = await self.claude_client.messages.create(
model="claude-sonnet-4-20250514", # 최신 Sonnet 모델
max_tokens=500,
temperature=0.3,
messages=[
{"role": "user", "content": prompt}
]
)
summary = response.content[0].text.strip()
return summary
except Exception as e:
logger.error(f"Error generating summary: {e}")
# 폴백: 원본 요약 사용
return enriched_item.rss_item.summary[:200] if enriched_item.rss_item.summary else enriched_item.rss_item.title
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
logger.info("AI Summarizer Worker stopped")
async def main():
"""메인 함수"""
worker = AISummarizerWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,3 @@
anthropic==0.50.0
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,19 @@
FROM python:3.11-slim
WORKDIR /app
# 의존성 설치
COPY ./article-assembly/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 공통 모듈 복사
COPY ./shared /app/shared
# Article Assembly 코드 복사
COPY ./article-assembly /app
# 환경변수
ENV PYTHONUNBUFFERED=1
# 실행
CMD ["python", "article_assembly.py"]

View File

@ -0,0 +1,234 @@
"""
Article Assembly Service
최종 기사 조립 및 MongoDB 저장 서비스
"""
import asyncio
import logging
import os
import sys
import json
from datetime import datetime
from typing import List, Dict, Any
from anthropic import AsyncAnthropic
from motor.motor_asyncio import AsyncIOMotorClient
# Import from shared module
from shared.models import PipelineJob, SummarizedItem, FinalArticle
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ArticleAssemblyWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.claude_api_key = os.getenv("CLAUDE_API_KEY")
self.claude_client = None
self.mongodb_url = os.getenv("MONGODB_URL", "mongodb://mongodb:27017")
self.db_name = os.getenv("DB_NAME", "pipeline_db")
self.db = None
async def start(self):
"""워커 시작"""
logger.info("Starting Article Assembly Worker")
# Redis 연결
await self.queue_manager.connect()
# MongoDB 연결
client = AsyncIOMotorClient(self.mongodb_url)
self.db = client[self.db_name]
# Claude 클라이언트 초기화
if self.claude_api_key:
self.claude_client = AsyncAnthropic(api_key=self.claude_api_key)
else:
logger.error("Claude API key not configured")
return
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기
job = await self.queue_manager.dequeue('article_assembly', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""최종 기사 조립 작업 처리"""
try:
start_time = datetime.now()
logger.info(f"Processing job {job.job_id} for article assembly")
summarized_items = job.data.get('summarized_items', [])
if not summarized_items:
logger.warning(f"No items to assemble for job {job.job_id}")
await self.queue_manager.mark_failed(
'article_assembly',
job,
"No items to assemble"
)
return
# 최종 기사 생성
article = await self._generate_final_article(job, summarized_items)
# 처리 시간 계산
processing_time = (datetime.now() - start_time).total_seconds()
article.processing_time = processing_time
# MongoDB에 저장
await self.db.articles.insert_one(article.dict())
logger.info(f"Article {article.article_id} saved to MongoDB")
# 완료 표시
job.stages_completed.append('article_assembly')
await self.queue_manager.mark_completed('article_assembly', job.job_id)
# 통계 업데이트
await self._update_statistics(job.keyword_id)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
await self.queue_manager.mark_failed('article_assembly', job, str(e))
async def _generate_final_article(
self,
job: PipelineJob,
summarized_items: List[Dict]
) -> FinalArticle:
"""Claude를 사용한 최종 기사 생성"""
# 아이템 정보 준비
items_text = []
for idx, item_data in enumerate(summarized_items, 1):
item = SummarizedItem(**item_data)
items_text.append(f"""
[뉴스 {idx}]
제목: {item.enriched_item['rss_item']['title']}
요약: {item.ai_summary}
출처: {item.enriched_item['rss_item']['link']}
""")
content = "\n".join(items_text)
# Claude로 종합 기사 작성
prompt = f"""다음 뉴스 항목들을 바탕으로 종합적인 기사를 작성해주세요.
키워드: {job.keyword}
뉴스 항목들:
{content}
다음 JSON 형식으로 작성해주세요:
{{
"title": "종합 기사 제목",
"content": "기사 본문 (1500자 이내, 문단 구분)",
"summary": "한 줄 요약 (100자 이내)",
"categories": ["카테고리1", "카테고리2"],
"tags": ["태그1", "태그2", "태그3"]
}}
요구사항:
- 전문적이고 객관적인 톤
- 핵심 정보와 트렌드 파악
- 시사점 포함
- 한국 독자 대상"""
try:
response = await self.claude_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=3000,
temperature=0.7,
messages=[
{"role": "user", "content": prompt}
]
)
# JSON 파싱
content_text = response.content[0].text
json_start = content_text.find('{')
json_end = content_text.rfind('}') + 1
if json_start != -1 and json_end > json_start:
article_data = json.loads(content_text[json_start:json_end])
else:
raise ValueError("No valid JSON in response")
# FinalArticle 생성
article = FinalArticle(
job_id=job.job_id,
keyword_id=job.keyword_id,
keyword=job.keyword,
title=article_data.get('title', f"{job.keyword} 종합 뉴스"),
content=article_data.get('content', ''),
summary=article_data.get('summary', ''),
source_items=[], # 간소화
images=[], # 이미지는 별도 서비스에서 처리
categories=article_data.get('categories', []),
tags=article_data.get('tags', []),
pipeline_stages=job.stages_completed,
processing_time=0 # 나중에 업데이트
)
return article
except Exception as e:
logger.error(f"Error generating article: {e}")
# 폴백 기사 생성
return FinalArticle(
job_id=job.job_id,
keyword_id=job.keyword_id,
keyword=job.keyword,
title=f"{job.keyword} 뉴스 요약 - {datetime.now().strftime('%Y-%m-%d')}",
content=content,
summary=f"{job.keyword} 관련 {len(summarized_items)}개 뉴스 요약",
source_items=[],
images=[],
categories=['자동생성'],
tags=[job.keyword],
pipeline_stages=job.stages_completed,
processing_time=0
)
async def _update_statistics(self, keyword_id: str):
"""키워드별 통계 업데이트"""
try:
await self.db.keyword_stats.update_one(
{"keyword_id": keyword_id},
{
"$inc": {"articles_generated": 1},
"$set": {"last_generated": datetime.now()}
},
upsert=True
)
except Exception as e:
logger.error(f"Error updating statistics: {e}")
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
logger.info("Article Assembly Worker stopped")
async def main():
"""메인 함수"""
worker = ArticleAssemblyWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,5 @@
anthropic==0.50.0
motor==3.1.1
pymongo==4.3.3
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,62 @@
#!/usr/bin/env python3
"""Fix import statements in all pipeline services"""
import os
import re
def fix_imports(filepath):
"""Fix import statements in a Python file"""
with open(filepath, 'r') as f:
content = f.read()
# Pattern to match the old import style
old_pattern = r"# 상위 디렉토리의 shared 모듈 import\nsys\.path\.append\(os\.path\.join\(os\.path\.dirname\(__file__\), '\.\.', 'shared'\)\)\nfrom ([\w, ]+) import ([\w, ]+)"
# Replace with new import style
def replace_imports(match):
modules = match.group(1)
items = match.group(2)
# Build new import statements
imports = []
if 'models' in modules:
imports.append(f"from shared.models import {items}" if 'models' in modules else "")
if 'queue_manager' in modules:
imports.append(f"from shared.queue_manager import QueueManager")
return "# Import from shared module\n" + "\n".join(filter(None, imports))
# Apply the replacement
new_content = re.sub(old_pattern, replace_imports, content)
# Also handle simpler patterns
new_content = new_content.replace(
"sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'shared'))\nfrom models import",
"from shared.models import"
)
new_content = new_content.replace(
"\nfrom queue_manager import",
"\nfrom shared.queue_manager import"
)
# Write back if changed
if new_content != content:
with open(filepath, 'w') as f:
f.write(new_content)
print(f"Fixed imports in {filepath}")
return True
return False
# Files to fix
files_to_fix = [
"monitor/monitor.py",
"google-search/google_search.py",
"article-assembly/article_assembly.py",
"rss-collector/rss_collector.py",
"ai-summarizer/ai_summarizer.py"
]
for file_path in files_to_fix:
full_path = os.path.join(os.path.dirname(__file__), file_path)
if os.path.exists(full_path):
fix_imports(full_path)

View File

@ -0,0 +1,19 @@
FROM python:3.11-slim
WORKDIR /app
# 의존성 설치
COPY ./google-search/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 공통 모듈 복사
COPY ./shared /app/shared
# Google Search 코드 복사
COPY ./google-search /app
# 환경변수
ENV PYTHONUNBUFFERED=1
# 실행
CMD ["python", "google_search.py"]

View File

@ -0,0 +1,153 @@
"""
Google Search Service
Google 검색으로 RSS 항목 강화
"""
import asyncio
import logging
import os
import sys
import json
from typing import List, Dict, Any
import aiohttp
from datetime import datetime
# Import from shared module
from shared.models import PipelineJob, RSSItem, SearchResult, EnrichedItem
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class GoogleSearchWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.google_api_key = os.getenv("GOOGLE_API_KEY")
self.search_engine_id = os.getenv("GOOGLE_SEARCH_ENGINE_ID")
self.max_results_per_item = 3
async def start(self):
"""워커 시작"""
logger.info("Starting Google Search Worker")
# Redis 연결
await self.queue_manager.connect()
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기
job = await self.queue_manager.dequeue('search_enrichment', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""검색 강화 작업 처리"""
try:
logger.info(f"Processing job {job.job_id} for search enrichment")
rss_items = job.data.get('rss_items', [])
enriched_items = []
# 최대 5개 항목만 처리 (API 할당량 관리)
for item_data in rss_items[:5]:
rss_item = RSSItem(**item_data)
# 제목으로 Google 검색
search_results = await self._search_google(rss_item.title)
enriched_item = EnrichedItem(
rss_item=rss_item,
search_results=search_results
)
enriched_items.append(enriched_item)
# API 속도 제한
await asyncio.sleep(0.5)
if enriched_items:
logger.info(f"Enriched {len(enriched_items)} items with search results")
# 다음 단계로 전달
job.data['enriched_items'] = [item.dict() for item in enriched_items]
job.stages_completed.append('search_enrichment')
job.stage = 'ai_summarization'
await self.queue_manager.enqueue('ai_summarization', job)
await self.queue_manager.mark_completed('search_enrichment', job.job_id)
else:
logger.warning(f"No items enriched for job {job.job_id}")
await self.queue_manager.mark_failed(
'search_enrichment',
job,
"No items to enrich"
)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
await self.queue_manager.mark_failed('search_enrichment', job, str(e))
async def _search_google(self, query: str) -> List[SearchResult]:
"""Google Custom Search API 호출"""
results = []
if not self.google_api_key or not self.search_engine_id:
logger.warning("Google API credentials not configured")
return results
try:
url = "https://www.googleapis.com/customsearch/v1"
params = {
"key": self.google_api_key,
"cx": self.search_engine_id,
"q": query,
"num": self.max_results_per_item,
"hl": "ko",
"gl": "kr"
}
async with aiohttp.ClientSession() as session:
async with session.get(url, params=params, timeout=30) as response:
if response.status == 200:
data = await response.json()
for item in data.get('items', []):
result = SearchResult(
title=item.get('title', ''),
link=item.get('link', ''),
snippet=item.get('snippet', ''),
source='google'
)
results.append(result)
else:
logger.error(f"Google API error: {response.status}")
except Exception as e:
logger.error(f"Error searching Google for '{query}': {e}")
return results
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
logger.info("Google Search Worker stopped")
async def main():
"""메인 함수"""
worker = GoogleSearchWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,3 @@
aiohttp==3.9.1
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,15 @@
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY ./image-generator/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy shared modules
COPY ./shared /app/shared
# Copy application code
COPY ./image-generator /app
CMD ["python", "image_generator.py"]

View File

@ -0,0 +1,225 @@
"""
Image Generation Service
Replicate API를 사용한 이미지 생성 서비스
"""
import asyncio
import logging
import os
import sys
import base64
from typing import List, Dict, Any
import httpx
from io import BytesIO
# Import from shared module
from shared.models import PipelineJob, TranslatedItem, GeneratedImageItem
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ImageGeneratorWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.replicate_api_key = os.getenv("REPLICATE_API_KEY")
self.replicate_api_url = "https://api.replicate.com/v1/predictions"
# Stable Diffusion 모델 사용
self.model_version = "stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b"
async def start(self):
"""워커 시작"""
logger.info("Starting Image Generator Worker")
# Redis 연결
await self.queue_manager.connect()
# API 키 확인
if not self.replicate_api_key:
logger.warning("Replicate API key not configured - using placeholder images")
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기
job = await self.queue_manager.dequeue('image_generation', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""이미지 생성 작업 처리"""
try:
logger.info(f"Processing job {job.job_id} for image generation")
translated_items = job.data.get('translated_items', [])
generated_items = []
# 최대 3개 아이템만 이미지 생성 (API 비용 절감)
for idx, item_data in enumerate(translated_items[:3]):
translated_item = TranslatedItem(**item_data)
# 이미지 생성을 위한 프롬프트 생성
prompt = self._create_image_prompt(translated_item)
# 이미지 생성
image_url = await self._generate_image(prompt)
generated_item = GeneratedImageItem(
translated_item=translated_item,
image_url=image_url,
image_prompt=prompt
)
generated_items.append(generated_item)
# API 속도 제한
if self.replicate_api_key:
await asyncio.sleep(2)
if generated_items:
logger.info(f"Generated images for {len(generated_items)} items")
# 완료된 데이터를 job에 저장
job.data['generated_items'] = [item.dict() for item in generated_items]
job.stages_completed.append('image_generation')
job.stage = 'completed'
# 최종 기사 조립 단계로 전달 (이미 article-assembly로 수정)
await self.queue_manager.enqueue('article_assembly', job)
await self.queue_manager.mark_completed('image_generation', job.job_id)
else:
logger.warning(f"No images generated for job {job.job_id}")
# 이미지 생성 실패해도 다음 단계로 진행
job.stages_completed.append('image_generation')
await self.queue_manager.enqueue('article_assembly', job)
await self.queue_manager.mark_completed('image_generation', job.job_id)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
# 이미지 생성 실패해도 다음 단계로 진행
job.stages_completed.append('image_generation')
await self.queue_manager.enqueue('article_assembly', job)
await self.queue_manager.mark_completed('image_generation', job.job_id)
def _create_image_prompt(self, translated_item: TranslatedItem) -> str:
"""이미지 생성을 위한 프롬프트 생성"""
# 영문 제목과 요약을 기반으로 프롬프트 생성
title = translated_item.translated_title or translated_item.summarized_item['enriched_item']['rss_item']['title']
summary = translated_item.translated_summary or translated_item.summarized_item['ai_summary']
# 뉴스 관련 이미지를 위한 프롬프트
prompt = f"News illustration for: {title[:100]}, professional, photorealistic, high quality, 4k"
return prompt
async def _generate_image(self, prompt: str) -> str:
"""Replicate API를 사용한 이미지 생성"""
try:
if not self.replicate_api_key:
# API 키가 없으면 플레이스홀더 이미지 URL 반환
return "https://via.placeholder.com/800x600.png?text=News+Image"
async with httpx.AsyncClient() as client:
# 예측 생성 요청
response = await client.post(
self.replicate_api_url,
headers={
"Authorization": f"Token {self.replicate_api_key}",
"Content-Type": "application/json"
},
json={
"version": self.model_version,
"input": {
"prompt": prompt,
"width": 768,
"height": 768,
"num_outputs": 1,
"scheduler": "K_EULER",
"num_inference_steps": 25,
"guidance_scale": 7.5,
"prompt_strength": 0.8,
"refine": "expert_ensemble_refiner",
"high_noise_frac": 0.8
}
},
timeout=60
)
if response.status_code in [200, 201]:
result = response.json()
prediction_id = result.get('id')
# 예측 결과 폴링
image_url = await self._poll_prediction(prediction_id)
return image_url
else:
logger.error(f"Replicate API error: {response.status_code}")
return "https://via.placeholder.com/800x600.png?text=Generation+Failed"
except Exception as e:
logger.error(f"Error generating image: {e}")
return "https://via.placeholder.com/800x600.png?text=Error"
async def _poll_prediction(self, prediction_id: str, max_attempts: int = 30) -> str:
"""예측 결과 폴링"""
try:
async with httpx.AsyncClient() as client:
for attempt in range(max_attempts):
response = await client.get(
f"{self.replicate_api_url}/{prediction_id}",
headers={
"Authorization": f"Token {self.replicate_api_key}"
},
timeout=30
)
if response.status_code == 200:
result = response.json()
status = result.get('status')
if status == 'succeeded':
output = result.get('output')
if output and isinstance(output, list) and len(output) > 0:
return output[0]
else:
return "https://via.placeholder.com/800x600.png?text=No+Output"
elif status == 'failed':
logger.error(f"Prediction failed: {result.get('error')}")
return "https://via.placeholder.com/800x600.png?text=Failed"
# 아직 처리중이면 대기
await asyncio.sleep(2)
else:
logger.error(f"Error polling prediction: {response.status_code}")
return "https://via.placeholder.com/800x600.png?text=Poll+Error"
# 최대 시도 횟수 초과
return "https://via.placeholder.com/800x600.png?text=Timeout"
except Exception as e:
logger.error(f"Error polling prediction: {e}")
return "https://via.placeholder.com/800x600.png?text=Poll+Exception"
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
logger.info("Image Generator Worker stopped")
async def main():
"""메인 함수"""
worker = ImageGeneratorWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,3 @@
httpx==0.25.0
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,22 @@
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY ./monitor/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy shared modules
COPY ./shared /app/shared
# Copy monitor code
COPY ./monitor /app
# Environment variables
ENV PYTHONUNBUFFERED=1
# Expose port
EXPOSE 8000
# Run
CMD ["uvicorn", "monitor:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]

View File

@ -0,0 +1,349 @@
"""
Pipeline Monitor Service
파이프라인 상태 모니터링 및 대시보드 API
"""
import os
import sys
import logging
from datetime import datetime, timedelta
from typing import Dict, List, Any
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from motor.motor_asyncio import AsyncIOMotorClient
import redis.asyncio as redis
# Import from shared module
from shared.models import KeywordSubscription, PipelineJob, FinalArticle
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI(title="Pipeline Monitor", version="1.0.0")
# CORS 설정
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Global connections
redis_client = None
mongodb_client = None
db = None
@app.on_event("startup")
async def startup_event():
"""서버 시작 시 연결 초기화"""
global redis_client, mongodb_client, db
# Redis 연결
redis_url = os.getenv("REDIS_URL", "redis://redis:6379")
redis_client = await redis.from_url(redis_url, decode_responses=True)
# MongoDB 연결
mongodb_url = os.getenv("MONGODB_URL", "mongodb://mongodb:27017")
mongodb_client = AsyncIOMotorClient(mongodb_url)
db = mongodb_client[os.getenv("DB_NAME", "pipeline_db")]
logger.info("Pipeline Monitor started successfully")
@app.on_event("shutdown")
async def shutdown_event():
"""서버 종료 시 연결 해제"""
if redis_client:
await redis_client.close()
if mongodb_client:
mongodb_client.close()
@app.get("/")
async def root():
"""헬스 체크"""
return {"status": "Pipeline Monitor is running"}
@app.get("/api/stats")
async def get_stats():
"""전체 파이프라인 통계"""
try:
# 큐별 대기 작업 수
queue_stats = {}
queues = [
"queue:keyword",
"queue:rss",
"queue:search",
"queue:summarize",
"queue:assembly"
]
for queue in queues:
length = await redis_client.llen(queue)
queue_stats[queue] = length
# 오늘 생성된 기사 수
today = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
articles_today = await db.articles.count_documents({
"created_at": {"$gte": today}
})
# 활성 키워드 수
active_keywords = await db.keywords.count_documents({
"is_active": True
})
# 총 기사 수
total_articles = await db.articles.count_documents({})
return {
"queues": queue_stats,
"articles_today": articles_today,
"active_keywords": active_keywords,
"total_articles": total_articles,
"timestamp": datetime.now().isoformat()
}
except Exception as e:
logger.error(f"Error getting stats: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/queues/{queue_name}")
async def get_queue_details(queue_name: str):
"""특정 큐의 상세 정보"""
try:
queue_key = f"queue:{queue_name}"
# 큐 길이
length = await redis_client.llen(queue_key)
# 최근 10개 작업 미리보기
items = await redis_client.lrange(queue_key, 0, 9)
# 처리 중인 작업
processing_key = f"processing:{queue_name}"
processing = await redis_client.smembers(processing_key)
# 실패한 작업
failed_key = f"failed:{queue_name}"
failed_count = await redis_client.llen(failed_key)
return {
"queue": queue_name,
"length": length,
"processing_count": len(processing),
"failed_count": failed_count,
"preview": items[:10],
"timestamp": datetime.now().isoformat()
}
except Exception as e:
logger.error(f"Error getting queue details: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/keywords")
async def get_keywords():
"""등록된 키워드 목록"""
try:
keywords = []
cursor = db.keywords.find({"is_active": True})
async for keyword in cursor:
# 해당 키워드의 최근 기사
latest_article = await db.articles.find_one(
{"keyword_id": str(keyword["_id"])},
sort=[("created_at", -1)]
)
keywords.append({
"id": str(keyword["_id"]),
"keyword": keyword["keyword"],
"schedule": keyword.get("schedule", "30분마다"),
"created_at": keyword.get("created_at"),
"last_article": latest_article["created_at"] if latest_article else None,
"article_count": await db.articles.count_documents(
{"keyword_id": str(keyword["_id"])}
)
})
return keywords
except Exception as e:
logger.error(f"Error getting keywords: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.post("/api/keywords")
async def add_keyword(keyword: str, schedule: str = "30min"):
"""새 키워드 등록"""
try:
new_keyword = {
"keyword": keyword,
"schedule": schedule,
"is_active": True,
"created_at": datetime.now(),
"updated_at": datetime.now()
}
result = await db.keywords.insert_one(new_keyword)
return {
"id": str(result.inserted_id),
"keyword": keyword,
"message": "Keyword registered successfully"
}
except Exception as e:
logger.error(f"Error adding keyword: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.delete("/api/keywords/{keyword_id}")
async def delete_keyword(keyword_id: str):
"""키워드 비활성화"""
try:
result = await db.keywords.update_one(
{"_id": keyword_id},
{"$set": {"is_active": False, "updated_at": datetime.now()}}
)
if result.modified_count > 0:
return {"message": "Keyword deactivated successfully"}
else:
raise HTTPException(status_code=404, detail="Keyword not found")
except Exception as e:
logger.error(f"Error deleting keyword: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/articles")
async def get_articles(limit: int = 10, skip: int = 0):
"""최근 생성된 기사 목록"""
try:
articles = []
cursor = db.articles.find().sort("created_at", -1).skip(skip).limit(limit)
async for article in cursor:
articles.append({
"id": str(article["_id"]),
"title": article["title"],
"keyword": article["keyword"],
"summary": article.get("summary", ""),
"created_at": article["created_at"],
"processing_time": article.get("processing_time", 0),
"pipeline_stages": article.get("pipeline_stages", [])
})
total = await db.articles.count_documents({})
return {
"articles": articles,
"total": total,
"limit": limit,
"skip": skip
}
except Exception as e:
logger.error(f"Error getting articles: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/articles/{article_id}")
async def get_article(article_id: str):
"""특정 기사 상세 정보"""
try:
article = await db.articles.find_one({"_id": article_id})
if not article:
raise HTTPException(status_code=404, detail="Article not found")
return article
except Exception as e:
logger.error(f"Error getting article: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/workers")
async def get_workers():
"""워커 상태 정보"""
try:
workers = {}
worker_types = [
"scheduler",
"rss_collector",
"google_search",
"ai_summarizer",
"article_assembly"
]
for worker_type in worker_types:
active_key = f"workers:{worker_type}:active"
active_workers = await redis_client.smembers(active_key)
workers[worker_type] = {
"active": len(active_workers),
"worker_ids": list(active_workers)
}
return workers
except Exception as e:
logger.error(f"Error getting workers: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.post("/api/trigger/{keyword}")
async def trigger_keyword_processing(keyword: str):
"""수동으로 키워드 처리 트리거"""
try:
# 키워드 찾기
keyword_doc = await db.keywords.find_one({
"keyword": keyword,
"is_active": True
})
if not keyword_doc:
raise HTTPException(status_code=404, detail="Keyword not found or inactive")
# 작업 생성
job = PipelineJob(
keyword_id=str(keyword_doc["_id"]),
keyword=keyword,
stage="keyword_processing",
created_at=datetime.now()
)
# 큐에 추가
await redis_client.rpush("queue:keyword", job.json())
return {
"message": f"Processing triggered for keyword: {keyword}",
"job_id": job.job_id
}
except Exception as e:
logger.error(f"Error triggering keyword: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/health")
async def health_check():
"""시스템 헬스 체크"""
try:
# Redis 체크
redis_status = await redis_client.ping()
# MongoDB 체크
mongodb_status = await db.command("ping")
return {
"status": "healthy",
"redis": "connected" if redis_status else "disconnected",
"mongodb": "connected" if mongodb_status else "disconnected",
"timestamp": datetime.now().isoformat()
}
except Exception as e:
return {
"status": "unhealthy",
"error": str(e),
"timestamp": datetime.now().isoformat()
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

View File

@ -0,0 +1,6 @@
fastapi==0.104.1
uvicorn[standard]==0.24.0
redis[hiredis]==5.0.1
motor==3.1.1
pymongo==4.3.3
pydantic==2.5.0

View File

@ -0,0 +1,19 @@
FROM python:3.11-slim
WORKDIR /app
# 의존성 설치
COPY ./rss-collector/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 공통 모듈 복사
COPY ./shared /app/shared
# RSS Collector 코드 복사
COPY ./rss-collector /app
# 환경변수
ENV PYTHONUNBUFFERED=1
# 실행
CMD ["python", "rss_collector.py"]

View File

@ -0,0 +1,4 @@
feedparser==6.0.11
aiohttp==3.9.1
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,192 @@
"""
RSS Collector Service
RSS 피드 수집 및 중복 제거 서비스
"""
import asyncio
import logging
import os
import sys
import hashlib
from datetime import datetime
import feedparser
import aiohttp
import redis.asyncio as redis
from typing import List, Dict, Any
# Import from shared module
from shared.models import PipelineJob, RSSItem, EnrichedItem
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class RSSCollectorWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.redis_client = None
self.redis_url = os.getenv("REDIS_URL", "redis://redis:6379")
self.dedup_ttl = 86400 * 7 # 7일간 중복 방지
self.max_items_per_feed = 10 # 피드당 최대 항목 수
async def start(self):
"""워커 시작"""
logger.info("Starting RSS Collector Worker")
# Redis 연결
await self.queue_manager.connect()
self.redis_client = await redis.from_url(
self.redis_url,
encoding="utf-8",
decode_responses=True
)
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기 (5초 대기)
job = await self.queue_manager.dequeue('rss_collection', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""RSS 수집 작업 처리"""
try:
logger.info(f"Processing job {job.job_id} for keyword '{job.keyword}'")
keyword = job.data.get('keyword', '')
rss_feeds = job.data.get('rss_feeds', [])
# 키워드가 포함된 RSS URL 생성
processed_feeds = self._prepare_feeds(rss_feeds, keyword)
all_items = []
for feed_url in processed_feeds:
try:
items = await self._fetch_rss_feed(feed_url, keyword)
all_items.extend(items)
except Exception as e:
logger.error(f"Error fetching feed {feed_url}: {e}")
if all_items:
# 중복 제거
unique_items = await self._deduplicate_items(all_items, keyword)
if unique_items:
logger.info(f"Collected {len(unique_items)} unique items for '{keyword}'")
# 다음 단계로 전달
job.data['rss_items'] = [item.dict() for item in unique_items]
job.stages_completed.append('rss_collection')
job.stage = 'search_enrichment'
await self.queue_manager.enqueue('search_enrichment', job)
await self.queue_manager.mark_completed('rss_collection', job.job_id)
else:
logger.info(f"No new items found for '{keyword}'")
await self.queue_manager.mark_completed('rss_collection', job.job_id)
else:
logger.warning(f"No RSS items collected for '{keyword}'")
await self.queue_manager.mark_failed(
'rss_collection',
job,
"No RSS items collected"
)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
await self.queue_manager.mark_failed('rss_collection', job, str(e))
def _prepare_feeds(self, feeds: List[str], keyword: str) -> List[str]:
"""RSS 피드 URL 준비 (키워드 치환)"""
processed = []
for feed in feeds:
if '{keyword}' in feed:
processed.append(feed.replace('{keyword}', keyword))
else:
processed.append(feed)
return processed
async def _fetch_rss_feed(self, feed_url: str, keyword: str) -> List[RSSItem]:
"""RSS 피드 가져오기"""
items = []
try:
async with aiohttp.ClientSession() as session:
async with session.get(feed_url, timeout=30) as response:
content = await response.text()
# feedparser로 파싱
feed = feedparser.parse(content)
for entry in feed.entries[:self.max_items_per_feed]:
# 키워드 관련성 체크
title = entry.get('title', '')
summary = entry.get('summary', '')
# 제목이나 요약에 키워드가 포함된 경우만
if keyword.lower() in title.lower() or keyword.lower() in summary.lower():
item = RSSItem(
title=title,
link=entry.get('link', ''),
published=entry.get('published', ''),
summary=summary[:500] if summary else '',
source_feed=feed_url
)
items.append(item)
except Exception as e:
logger.error(f"Error fetching RSS feed {feed_url}: {e}")
return items
async def _deduplicate_items(self, items: List[RSSItem], keyword: str) -> List[RSSItem]:
"""중복 항목 제거"""
unique_items = []
dedup_key = f"dedup:{keyword}"
for item in items:
# 제목 해시 생성
item_hash = hashlib.md5(
f"{keyword}:{item.title}".encode()
).hexdigest()
# Redis Set으로 중복 확인
is_new = await self.redis_client.sadd(dedup_key, item_hash)
if is_new:
unique_items.append(item)
# TTL 설정
if unique_items:
await self.redis_client.expire(dedup_key, self.dedup_ttl)
return unique_items
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
if self.redis_client:
await self.redis_client.close()
logger.info("RSS Collector Worker stopped")
async def main():
"""메인 함수"""
worker = RSSCollectorWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1,19 @@
FROM python:3.11-slim
WORKDIR /app
# 의존성 설치
COPY ./scheduler/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 공통 모듈 복사
COPY ./shared /app/shared
# 스케줄러 코드 복사
COPY ./scheduler /app
# 환경변수
ENV PYTHONUNBUFFERED=1
# 실행
CMD ["python", "scheduler.py"]

View File

@ -0,0 +1,5 @@
apscheduler==3.10.4
motor==3.1.1
pymongo==4.3.3
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,203 @@
"""
News Pipeline Scheduler
뉴스 파이프라인 스케줄러 서비스
"""
import asyncio
import logging
import os
import sys
from datetime import datetime, timedelta
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from motor.motor_asyncio import AsyncIOMotorClient
# Import from shared module
from shared.models import KeywordSubscription, PipelineJob
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class NewsScheduler:
def __init__(self):
self.scheduler = AsyncIOScheduler()
self.mongodb_url = os.getenv("MONGODB_URL", "mongodb://mongodb:27017")
self.db_name = os.getenv("DB_NAME", "pipeline_db")
self.db = None
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
async def start(self):
"""스케줄러 시작"""
logger.info("Starting News Pipeline Scheduler")
# MongoDB 연결
client = AsyncIOMotorClient(self.mongodb_url)
self.db = client[self.db_name]
# Redis 연결
await self.queue_manager.connect()
# 기본 스케줄 설정
# 매 30분마다 실행
self.scheduler.add_job(
self.process_keywords,
'interval',
minutes=30,
id='keyword_processor',
name='Process Active Keywords'
)
# 특정 시간대 강화 스케줄 (아침 7시, 점심 12시, 저녁 6시)
for hour in [7, 12, 18]:
self.scheduler.add_job(
self.process_priority_keywords,
'cron',
hour=hour,
minute=0,
id=f'priority_processor_{hour}',
name=f'Process Priority Keywords at {hour}:00'
)
# 매일 자정 통계 초기화
self.scheduler.add_job(
self.reset_daily_stats,
'cron',
hour=0,
minute=0,
id='stats_reset',
name='Reset Daily Statistics'
)
self.scheduler.start()
logger.info("Scheduler started successfully")
# 시작 즉시 한 번 실행
await self.process_keywords()
async def process_keywords(self):
"""활성 키워드 처리"""
try:
logger.info("Processing active keywords")
# MongoDB에서 활성 키워드 로드
now = datetime.now()
thirty_minutes_ago = now - timedelta(minutes=30)
keywords = await self.db.keywords.find({
"is_active": True,
"$or": [
{"last_processed": {"$lt": thirty_minutes_ago}},
{"last_processed": None}
]
}).to_list(None)
logger.info(f"Found {len(keywords)} keywords to process")
for keyword_doc in keywords:
await self._create_job(keyword_doc)
# 처리 시간 업데이트
await self.db.keywords.update_one(
{"keyword_id": keyword_doc['keyword_id']},
{"$set": {"last_processed": now}}
)
logger.info(f"Created jobs for {len(keywords)} keywords")
except Exception as e:
logger.error(f"Error processing keywords: {e}")
async def process_priority_keywords(self):
"""우선순위 키워드 처리"""
try:
logger.info("Processing priority keywords")
keywords = await self.db.keywords.find({
"is_active": True,
"is_priority": True
}).to_list(None)
for keyword_doc in keywords:
await self._create_job(keyword_doc, priority=1)
logger.info(f"Created priority jobs for {len(keywords)} keywords")
except Exception as e:
logger.error(f"Error processing priority keywords: {e}")
async def _create_job(self, keyword_doc: dict, priority: int = 0):
"""파이프라인 작업 생성"""
try:
# KeywordSubscription 모델로 변환
keyword = KeywordSubscription(**keyword_doc)
# PipelineJob 생성
job = PipelineJob(
keyword_id=keyword.keyword_id,
keyword=keyword.keyword,
stage='rss_collection',
stages_completed=[],
priority=priority,
data={
'keyword': keyword.keyword,
'language': keyword.language,
'rss_feeds': keyword.rss_feeds or self._get_default_rss_feeds(),
'categories': keyword.categories
}
)
# 첫 번째 큐에 추가
await self.queue_manager.enqueue(
'rss_collection',
job,
priority=priority
)
logger.info(f"Created job {job.job_id} for keyword '{keyword.keyword}'")
except Exception as e:
logger.error(f"Error creating job for keyword: {e}")
def _get_default_rss_feeds(self) -> list:
"""기본 RSS 피드 목록"""
return [
"https://news.google.com/rss/search?q={keyword}&hl=ko&gl=KR&ceid=KR:ko",
"https://trends.google.com/trends/trendingsearches/daily/rss?geo=KR",
"https://www.mk.co.kr/rss/40300001/", # 매일경제
"https://www.hankyung.com/feed/all-news", # 한국경제
"https://www.zdnet.co.kr/news/news_rss.xml", # ZDNet Korea
]
async def reset_daily_stats(self):
"""일일 통계 초기화"""
try:
logger.info("Resetting daily statistics")
# Redis 통계 초기화
# 구현 필요
pass
except Exception as e:
logger.error(f"Error resetting stats: {e}")
async def stop(self):
"""스케줄러 중지"""
self.scheduler.shutdown()
await self.queue_manager.disconnect()
logger.info("Scheduler stopped")
async def main():
"""메인 함수"""
scheduler = NewsScheduler()
try:
await scheduler.start()
# 계속 실행
while True:
await asyncio.sleep(60)
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await scheduler.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -0,0 +1 @@
# Shared modules for pipeline services

View File

@ -0,0 +1,113 @@
"""
Pipeline Data Models
파이프라인 전체에서 사용되는 공통 데이터 모델
"""
from datetime import datetime
from typing import List, Dict, Any, Optional
from pydantic import BaseModel, Field
import uuid
class KeywordSubscription(BaseModel):
"""키워드 구독 모델"""
keyword_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
keyword: str
language: str = "ko"
schedule: str = "0 */30 * * *" # Cron expression (30분마다)
is_active: bool = True
is_priority: bool = False
last_processed: Optional[datetime] = None
rss_feeds: List[str] = Field(default_factory=list)
categories: List[str] = Field(default_factory=list)
created_at: datetime = Field(default_factory=datetime.now)
owner: Optional[str] = None
class PipelineJob(BaseModel):
"""파이프라인 작업 모델"""
job_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
keyword_id: str
keyword: str
stage: str # current stage
stages_completed: List[str] = Field(default_factory=list)
data: Dict[str, Any] = Field(default_factory=dict)
retry_count: int = 0
max_retries: int = 3
priority: int = 0
created_at: datetime = Field(default_factory=datetime.now)
updated_at: datetime = Field(default_factory=datetime.now)
class RSSItem(BaseModel):
"""RSS 피드 아이템"""
item_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
title: str
link: str
published: Optional[str] = None
summary: Optional[str] = None
source_feed: str
class SearchResult(BaseModel):
"""검색 결과"""
title: str
link: str
snippet: Optional[str] = None
source: str = "google"
class EnrichedItem(BaseModel):
"""강화된 뉴스 아이템"""
rss_item: RSSItem
search_results: List[SearchResult] = Field(default_factory=list)
class SummarizedItem(BaseModel):
"""요약된 아이템"""
enriched_item: EnrichedItem
ai_summary: str
summary_language: str = "ko"
class TranslatedItem(BaseModel):
"""번역된 아이템"""
summarized_item: SummarizedItem
title_en: str
summary_en: str
class ItemWithImage(BaseModel):
"""이미지가 추가된 아이템"""
translated_item: TranslatedItem
image_url: str
image_prompt: str
class FinalArticle(BaseModel):
"""최종 기사"""
article_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
job_id: str
keyword_id: str
keyword: str
title: str
content: str
summary: str
source_items: List[ItemWithImage]
images: List[str]
categories: List[str] = Field(default_factory=list)
tags: List[str] = Field(default_factory=list)
created_at: datetime = Field(default_factory=datetime.now)
pipeline_stages: List[str]
processing_time: float # seconds
class TranslatedItem(BaseModel):
"""번역된 아이템"""
summarized_item: Dict[str, Any] # SummarizedItem as dict
translated_title: str
translated_summary: str
target_language: str = 'en'
class GeneratedImageItem(BaseModel):
"""이미지 생성된 아이템"""
translated_item: Dict[str, Any] # TranslatedItem as dict
image_url: str
image_prompt: str
class QueueMessage(BaseModel):
"""큐 메시지"""
message_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
queue_name: str
job: PipelineJob
timestamp: datetime = Field(default_factory=datetime.now)
retry_count: int = 0

View File

@ -0,0 +1,173 @@
"""
Queue Manager
Redis 기반 큐 관리 시스템
"""
import redis.asyncio as redis
import json
import logging
from typing import Optional, Dict, Any, List
from datetime import datetime
from .models import PipelineJob, QueueMessage
logger = logging.getLogger(__name__)
class QueueManager:
"""Redis 기반 큐 매니저"""
QUEUES = {
"keyword_processing": "queue:keyword",
"rss_collection": "queue:rss",
"search_enrichment": "queue:search",
"ai_summarization": "queue:summarize",
"translation": "queue:translate",
"image_generation": "queue:image",
"article_assembly": "queue:assembly",
"failed": "queue:failed",
"scheduled": "queue:scheduled"
}
def __init__(self, redis_url: str = "redis://redis:6379"):
self.redis_url = redis_url
self.redis_client: Optional[redis.Redis] = None
async def connect(self):
"""Redis 연결"""
if not self.redis_client:
self.redis_client = await redis.from_url(
self.redis_url,
encoding="utf-8",
decode_responses=True
)
logger.info("Connected to Redis")
async def disconnect(self):
"""Redis 연결 해제"""
if self.redis_client:
await self.redis_client.close()
self.redis_client = None
async def enqueue(self, queue_name: str, job: PipelineJob, priority: int = 0) -> str:
"""작업을 큐에 추가"""
try:
queue_key = self.QUEUES.get(queue_name, f"queue:{queue_name}")
message = QueueMessage(
queue_name=queue_name,
job=job
)
# 우선순위에 따라 추가
if priority > 0:
await self.redis_client.lpush(queue_key, message.json())
else:
await self.redis_client.rpush(queue_key, message.json())
# 통계 업데이트
await self.redis_client.hincrby("stats:queues", queue_name, 1)
logger.info(f"Job {job.job_id} enqueued to {queue_name}")
return job.job_id
except Exception as e:
logger.error(f"Failed to enqueue job: {e}")
raise
async def dequeue(self, queue_name: str, timeout: int = 0) -> Optional[PipelineJob]:
"""큐에서 작업 가져오기"""
try:
queue_key = self.QUEUES.get(queue_name, f"queue:{queue_name}")
if timeout > 0:
result = await self.redis_client.blpop(queue_key, timeout=timeout)
if result:
_, data = result
else:
return None
else:
data = await self.redis_client.lpop(queue_key)
if data:
message = QueueMessage.parse_raw(data)
# 처리 중 목록에 추가
processing_key = f"processing:{queue_name}"
await self.redis_client.hset(
processing_key,
message.job.job_id,
message.json()
)
return message.job
return None
except Exception as e:
logger.error(f"Failed to dequeue job: {e}")
return None
async def mark_completed(self, queue_name: str, job_id: str):
"""작업 완료 표시"""
try:
processing_key = f"processing:{queue_name}"
await self.redis_client.hdel(processing_key, job_id)
# 통계 업데이트
await self.redis_client.hincrby("stats:completed", queue_name, 1)
logger.info(f"Job {job_id} completed in {queue_name}")
except Exception as e:
logger.error(f"Failed to mark job as completed: {e}")
async def mark_failed(self, queue_name: str, job: PipelineJob, error: str):
"""작업 실패 처리"""
try:
processing_key = f"processing:{queue_name}"
await self.redis_client.hdel(processing_key, job.job_id)
# 재시도 확인
if job.retry_count < job.max_retries:
job.retry_count += 1
await self.enqueue(queue_name, job)
logger.info(f"Job {job.job_id} requeued (retry {job.retry_count}/{job.max_retries})")
else:
# 실패 큐로 이동
job.data["error"] = error
job.data["failed_stage"] = queue_name
await self.enqueue("failed", job)
# 통계 업데이트
await self.redis_client.hincrby("stats:failed", queue_name, 1)
logger.error(f"Job {job.job_id} failed: {error}")
except Exception as e:
logger.error(f"Failed to mark job as failed: {e}")
async def get_queue_stats(self) -> Dict[str, Any]:
"""큐 통계 조회"""
try:
stats = {}
for name, key in self.QUEUES.items():
stats[name] = {
"pending": await self.redis_client.llen(key),
"processing": await self.redis_client.hlen(f"processing:{name}"),
}
# 완료/실패 통계
stats["completed"] = await self.redis_client.hgetall("stats:completed") or {}
stats["failed"] = await self.redis_client.hgetall("stats:failed") or {}
return stats
except Exception as e:
logger.error(f"Failed to get queue stats: {e}")
return {}
async def clear_queue(self, queue_name: str):
"""큐 초기화 (테스트용)"""
queue_key = self.QUEUES.get(queue_name, f"queue:{queue_name}")
await self.redis_client.delete(queue_key)
await self.redis_client.delete(f"processing:{queue_name}")
logger.info(f"Queue {queue_name} cleared")

View File

@ -0,0 +1,5 @@
redis[hiredis]==5.0.1
motor==3.1.1
pymongo==4.3.3
pydantic==2.5.0
python-dateutil==2.8.2

View File

@ -0,0 +1,15 @@
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY ./translator/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy shared modules
COPY ./shared /app/shared
# Copy application code
COPY ./translator /app
CMD ["python", "translator.py"]

View File

@ -0,0 +1,3 @@
httpx==0.25.0
redis[hiredis]==5.0.1
pydantic==2.5.0

View File

@ -0,0 +1,154 @@
"""
Translation Service
DeepL API를 사용한 번역 서비스
"""
import asyncio
import logging
import os
import sys
from typing import List, Dict, Any
import httpx
# Import from shared module
from shared.models import PipelineJob, SummarizedItem, TranslatedItem
from shared.queue_manager import QueueManager
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class TranslatorWorker:
def __init__(self):
self.queue_manager = QueueManager(
redis_url=os.getenv("REDIS_URL", "redis://redis:6379")
)
self.deepl_api_key = os.getenv("DEEPL_API_KEY", "3abbc796-2515-44a8-972d-22dcf27ab54a")
# DeepL Pro API 엔드포인트 사용
self.deepl_api_url = "https://api.deepl.com/v2/translate"
async def start(self):
"""워커 시작"""
logger.info("Starting Translator Worker")
# Redis 연결
await self.queue_manager.connect()
# DeepL API 키 확인
if not self.deepl_api_key:
logger.error("DeepL API key not configured")
return
# 메인 처리 루프
while True:
try:
# 큐에서 작업 가져오기
job = await self.queue_manager.dequeue('translation', timeout=5)
if job:
await self.process_job(job)
except Exception as e:
logger.error(f"Error in worker loop: {e}")
await asyncio.sleep(1)
async def process_job(self, job: PipelineJob):
"""번역 작업 처리"""
try:
logger.info(f"Processing job {job.job_id} for translation")
summarized_items = job.data.get('summarized_items', [])
translated_items = []
for item_data in summarized_items:
summarized_item = SummarizedItem(**item_data)
# 제목과 요약 번역
translated_title = await self._translate_text(
summarized_item.enriched_item['rss_item']['title'],
target_lang='EN'
)
translated_summary = await self._translate_text(
summarized_item.ai_summary,
target_lang='EN'
)
translated_item = TranslatedItem(
summarized_item=summarized_item,
translated_title=translated_title,
translated_summary=translated_summary,
target_language='en'
)
translated_items.append(translated_item)
# API 속도 제한
await asyncio.sleep(0.5)
if translated_items:
logger.info(f"Translated {len(translated_items)} items")
# 다음 단계로 전달
job.data['translated_items'] = [item.dict() for item in translated_items]
job.stages_completed.append('translation')
job.stage = 'image_generation'
await self.queue_manager.enqueue('image_generation', job)
await self.queue_manager.mark_completed('translation', job.job_id)
else:
logger.warning(f"No items translated for job {job.job_id}")
await self.queue_manager.mark_failed(
'translation',
job,
"No items to translate"
)
except Exception as e:
logger.error(f"Error processing job {job.job_id}: {e}")
await self.queue_manager.mark_failed('translation', job, str(e))
async def _translate_text(self, text: str, target_lang: str = 'EN') -> str:
"""DeepL API를 사용한 텍스트 번역"""
try:
if not text:
return ""
async with httpx.AsyncClient() as client:
response = await client.post(
self.deepl_api_url,
data={
'auth_key': self.deepl_api_key,
'text': text,
'target_lang': target_lang,
'source_lang': 'KO'
},
timeout=30
)
if response.status_code == 200:
result = response.json()
return result['translations'][0]['text']
else:
logger.error(f"DeepL API error: {response.status_code}")
return text # 번역 실패시 원본 반환
except Exception as e:
logger.error(f"Error translating text: {e}")
return text # 번역 실패시 원본 반환
async def stop(self):
"""워커 중지"""
await self.queue_manager.disconnect()
logger.info("Translator Worker stopped")
async def main():
"""메인 함수"""
worker = TranslatorWorker()
try:
await worker.start()
except KeyboardInterrupt:
logger.info("Received interrupt signal")
finally:
await worker.stop()
if __name__ == "__main__":
asyncio.run(main())

View File

@ -1,204 +0,0 @@
# RSS Feed Subscription Service
RSS/Atom 피드를 구독하고 관리하는 서비스입니다.
## 주요 기능
### 1. 피드 구독 관리
- RSS/Atom 피드 URL 구독
- 카테고리별 분류 (뉴스, 기술, 비즈니스 등)
- 자동 업데이트 스케줄링
- 피드 상태 모니터링
### 2. 엔트리 관리
- 새로운 글 자동 수집
- 읽음/안읽음 상태 관리
- 별표 표시 기능
- 전체 내용 저장
### 3. 자동 업데이트
- 설정 가능한 업데이트 주기 (기본 15분)
- 백그라운드 스케줄러
- 에러 처리 및 재시도
## API 엔드포인트
### 피드 구독
```
POST /api/feeds
{
"url": "https://example.com/rss",
"title": "Example Blog",
"category": "tech",
"update_interval": 900
}
```
### 피드 목록 조회
```
GET /api/feeds?category=tech&status=active
```
### 엔트리 조회
```
GET /api/entries?feed_id=xxx&is_read=false&limit=50
```
### 읽음 표시
```
PUT /api/entries/{entry_id}/read?is_read=true
```
### 별표 표시
```
PUT /api/entries/{entry_id}/star?is_starred=true
```
### 통계 조회
```
GET /api/stats?feed_id=xxx
```
### OPML 내보내기
```
GET /api/export/opml
```
## 사용 예제
### 1. 기술 블로그 구독
```bash
curl -X POST http://localhost:8017/api/feeds \
-H "Content-Type: application/json" \
-d '{
"url": "https://techcrunch.com/feed/",
"category": "tech"
}'
```
### 2. 한국 뉴스 RSS 구독
```bash
curl -X POST http://localhost:8017/api/feeds \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.hani.co.kr/rss/",
"category": "news",
"update_interval": 600
}'
```
### 3. 안읽은 엔트리 조회
```bash
curl "http://localhost:8017/api/entries?is_read=false&limit=20"
```
### 4. 모든 엔트리 읽음 처리
```bash
curl -X POST "http://localhost:8017/api/entries/mark-all-read?feed_id=xxx"
```
## 지원 카테고리
- `news`: 뉴스
- `tech`: 기술
- `business`: 비즈니스
- `science`: 과학
- `health`: 건강
- `sports`: 스포츠
- `entertainment`: 엔터테인먼트
- `lifestyle`: 라이프스타일
- `politics`: 정치
- `other`: 기타
## 환경 설정
### 필수 설정
```env
MONGODB_URL=mongodb://mongodb:27017
DB_NAME=rss_feed_db
REDIS_URL=redis://redis:6379
REDIS_DB=3
```
### 선택 설정
```env
DEFAULT_UPDATE_INTERVAL=900 # 기본 업데이트 주기 (초)
MAX_ENTRIES_PER_FEED=100 # 피드당 최대 엔트리 수
ENABLE_SCHEDULER=true # 자동 업데이트 활성화
SCHEDULER_TIMEZONE=Asia/Seoul # 스케줄러 타임존
```
## Docker 실행
```bash
# 빌드 및 실행
docker-compose build rss-feed-backend
docker-compose up -d rss-feed-backend
# 로그 확인
docker-compose logs -f rss-feed-backend
```
## 데이터 구조
### FeedSubscription
- `title`: 피드 제목
- `url`: RSS/Atom URL
- `description`: 설명
- `category`: 카테고리
- `status`: 상태 (active/inactive/error)
- `update_interval`: 업데이트 주기
- `last_fetch`: 마지막 업데이트 시간
- `error_count`: 에러 횟수
### FeedEntry
- `feed_id`: 피드 ID
- `title`: 글 제목
- `link`: 원문 링크
- `summary`: 요약
- `content`: 전체 내용
- `author`: 작성자
- `published`: 발행일
- `categories`: 태그/카테고리
- `thumbnail`: 썸네일 이미지
- `is_read`: 읽음 상태
- `is_starred`: 별표 상태
## 추천 RSS 피드
### 한국 뉴스
- 한겨레: `https://www.hani.co.kr/rss/`
- 조선일보: `https://www.chosun.com/arc/outboundfeeds/rss/`
- 중앙일보: `https://rss.joins.com/joins_news_list.xml`
### 기술 블로그
- TechCrunch: `https://techcrunch.com/feed/`
- The Verge: `https://www.theverge.com/rss/index.xml`
- Ars Technica: `https://feeds.arstechnica.com/arstechnica/index`
### 개발자 블로그
- GitHub Blog: `https://github.blog/feed/`
- Stack Overflow Blog: `https://stackoverflow.blog/feed/`
- Dev.to: `https://dev.to/feed`
## 헬스 체크
```bash
curl http://localhost:8017/health
```
## 문제 해결
### 1. 피드 파싱 실패
- RSS/Atom 형식이 올바른지 확인
- URL이 접근 가능한지 확인
- 피드 인코딩 확인 (UTF-8 권장)
### 2. 업데이트 안됨
- 스케줄러 활성화 확인 (`ENABLE_SCHEDULER=true`)
- MongoDB 연결 상태 확인
- 피드 상태가 `active`인지 확인
### 3. 중복 엔트리
- 피드에서 고유 ID를 제공하는지 확인
- 엔트리 ID 생성 로직 확인

View File

@ -1,21 +0,0 @@
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Expose port
EXPOSE 8000
# Run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]

View File

@ -1,26 +0,0 @@
from pydantic_settings import BaseSettings
from typing import Optional
class Settings(BaseSettings):
# MongoDB Configuration
mongodb_url: str = "mongodb://mongodb:27017"
db_name: str = "rss_feed_db"
# Redis Configuration
redis_url: str = "redis://redis:6379"
redis_db: int = 3
# Feed Settings
default_update_interval: int = 900 # 15 minutes in seconds
max_entries_per_feed: int = 100
fetch_timeout: int = 30
# Scheduler Settings
enable_scheduler: bool = True
scheduler_timezone: str = "Asia/Seoul"
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
settings = Settings()

View File

@ -1,222 +0,0 @@
import feedparser
import httpx
from typing import List, Dict, Any, Optional
from datetime import datetime
from dateutil import parser as date_parser
from bs4 import BeautifulSoup
import re
import hashlib
from .models import FeedEntry
class FeedParser:
def __init__(self):
self.client = httpx.AsyncClient(
timeout=30.0,
follow_redirects=True,
headers={
"User-Agent": "Mozilla/5.0 (compatible; RSS Feed Reader/1.0)"
}
)
async def parse_feed(self, url: str) -> Dict[str, Any]:
"""Parse RSS/Atom feed from URL"""
try:
response = await self.client.get(url)
response.raise_for_status()
# Parse the feed
feed = feedparser.parse(response.content)
if feed.bozo and feed.bozo_exception:
raise Exception(f"Feed parsing error: {feed.bozo_exception}")
return {
"success": True,
"feed": feed.feed,
"entries": feed.entries,
"error": None
}
except Exception as e:
return {
"success": False,
"feed": None,
"entries": [],
"error": str(e)
}
def extract_entry_data(self, entry: Any, feed_id: str) -> FeedEntry:
"""Extract and normalize entry data"""
# Generate unique entry ID
entry_id = self._generate_entry_id(entry)
# Extract title
title = entry.get("title", "Untitled")
# Extract link
link = entry.get("link", "")
# Extract summary/description
summary = self._extract_summary(entry)
# Extract content
content = self._extract_content(entry)
# Extract author
author = entry.get("author", "")
# Extract published date
published = self._parse_date(entry.get("published", entry.get("updated")))
# Extract updated date
updated = self._parse_date(entry.get("updated", entry.get("published")))
# Extract categories
categories = self._extract_categories(entry)
# Extract thumbnail
thumbnail = self._extract_thumbnail(entry)
# Extract enclosures (media attachments)
enclosures = self._extract_enclosures(entry)
return FeedEntry(
feed_id=feed_id,
entry_id=entry_id,
title=title,
link=link,
summary=summary,
content=content,
author=author,
published=published,
updated=updated,
categories=categories,
thumbnail=thumbnail,
enclosures=enclosures
)
def _generate_entry_id(self, entry: Any) -> str:
"""Generate unique ID for entry"""
# Try to use entry's unique ID first
if hasattr(entry, "id"):
return entry.id
# Generate from link and title
unique_str = f"{entry.get('link', '')}{entry.get('title', '')}"
return hashlib.md5(unique_str.encode()).hexdigest()
def _extract_summary(self, entry: Any) -> Optional[str]:
"""Extract and clean summary"""
summary = entry.get("summary", entry.get("description", ""))
if summary:
# Clean HTML tags
soup = BeautifulSoup(summary, "html.parser")
text = soup.get_text(separator=" ", strip=True)
# Limit length
if len(text) > 500:
text = text[:497] + "..."
return text
return None
def _extract_content(self, entry: Any) -> Optional[str]:
"""Extract full content"""
content = ""
# Try content field
if hasattr(entry, "content"):
for c in entry.content:
if c.get("type") in ["text/html", "text/plain"]:
content = c.get("value", "")
break
# Fallback to summary detail
if not content and hasattr(entry, "summary_detail"):
content = entry.summary_detail.get("value", "")
# Clean excessive whitespace
if content:
content = re.sub(r'\s+', ' ', content).strip()
return content
return None
def _parse_date(self, date_str: Optional[str]) -> Optional[datetime]:
"""Parse date string to datetime"""
if not date_str:
return None
try:
# Try parsing with dateutil
return date_parser.parse(date_str)
except:
try:
# Try feedparser's time structure
if hasattr(date_str, "tm_year"):
import time
return datetime.fromtimestamp(time.mktime(date_str))
except:
pass
return None
def _extract_categories(self, entry: Any) -> List[str]:
"""Extract categories/tags"""
categories = []
if hasattr(entry, "tags"):
for tag in entry.tags:
if hasattr(tag, "term"):
categories.append(tag.term)
elif isinstance(tag, str):
categories.append(tag)
return categories
def _extract_thumbnail(self, entry: Any) -> Optional[str]:
"""Extract thumbnail image URL"""
# Check media thumbnail
if hasattr(entry, "media_thumbnail"):
for thumb in entry.media_thumbnail:
if thumb.get("url"):
return thumb["url"]
# Check media content
if hasattr(entry, "media_content"):
for media in entry.media_content:
if media.get("type", "").startswith("image/"):
return media.get("url")
# Check enclosures
if hasattr(entry, "enclosures"):
for enc in entry.enclosures:
if enc.get("type", "").startswith("image/"):
return enc.get("href", enc.get("url"))
# Extract from content/summary
content = entry.get("summary", "") + entry.get("content", [{}])[0].get("value", "") if hasattr(entry, "content") else ""
if content:
soup = BeautifulSoup(content, "html.parser")
img = soup.find("img")
if img and img.get("src"):
return img["src"]
return None
def _extract_enclosures(self, entry: Any) -> List[Dict[str, Any]]:
"""Extract media enclosures"""
enclosures = []
if hasattr(entry, "enclosures"):
for enc in entry.enclosures:
enclosure = {
"url": enc.get("href", enc.get("url", "")),
"type": enc.get("type", ""),
"length": enc.get("length", 0)
}
if enclosure["url"]:
enclosures.append(enclosure)
return enclosures
async def close(self):
"""Close HTTP client"""
await self.client.aclose()

View File

@ -1,442 +0,0 @@
from fastapi import FastAPI, HTTPException, Query, Path, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
from typing import List, Optional
from datetime import datetime
from contextlib import asynccontextmanager
import motor.motor_asyncio
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from apscheduler.triggers.interval import IntervalTrigger
import pytz
import redis.asyncio as redis
import json
from .config import settings
from .models import (
FeedSubscription, FeedEntry, CreateFeedRequest,
UpdateFeedRequest, FeedStatistics, FeedStatus
)
from .feed_parser import FeedParser
# Database connection
db_client = None
db = None
redis_client = None
scheduler = None
parser = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global db_client, db, redis_client, scheduler, parser
# Connect to MongoDB
db_client = motor.motor_asyncio.AsyncIOMotorClient(settings.mongodb_url)
db = db_client[settings.db_name]
# Connect to Redis
redis_client = redis.from_url(settings.redis_url, db=settings.redis_db)
# Initialize feed parser
parser = FeedParser()
# Initialize scheduler
if settings.enable_scheduler:
scheduler = AsyncIOScheduler(timezone=pytz.timezone(settings.scheduler_timezone))
scheduler.add_job(
update_all_feeds,
trigger=IntervalTrigger(seconds=60),
id="update_feeds",
replace_existing=True
)
scheduler.start()
print("RSS Feed scheduler started")
print("RSS Feed Service starting...")
yield
# Cleanup
if scheduler:
scheduler.shutdown()
if parser:
await parser.close()
if redis_client:
await redis_client.close()
db_client.close()
print("RSS Feed Service stopping...")
app = FastAPI(
title="RSS Feed Service",
description="RSS/Atom 피드 구독 및 관리 서비스",
version="1.0.0",
lifespan=lifespan
)
# CORS 설정
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Helper functions
async def update_feed(feed_id: str):
"""Update a single feed"""
feed = await db.feeds.find_one({"_id": feed_id})
if not feed:
return
# Parse feed
result = await parser.parse_feed(feed["url"])
if result["success"]:
# Update feed metadata
await db.feeds.update_one(
{"_id": feed_id},
{
"$set": {
"last_fetch": datetime.now(),
"status": FeedStatus.ACTIVE,
"error_count": 0,
"last_error": None,
"updated_at": datetime.now()
}
}
)
# Process entries
for entry_data in result["entries"][:settings.max_entries_per_feed]:
entry = parser.extract_entry_data(entry_data, feed_id)
# Check if entry already exists
existing = await db.entries.find_one({
"feed_id": feed_id,
"entry_id": entry.entry_id
})
if not existing:
# Insert new entry
await db.entries.insert_one(entry.dict())
else:
# Update existing entry if newer
if entry.updated and existing.get("updated"):
if entry.updated > existing["updated"]:
await db.entries.update_one(
{"_id": existing["_id"]},
{"$set": entry.dict(exclude={"id", "created_at"})}
)
else:
# Update error status
await db.feeds.update_one(
{"_id": feed_id},
{
"$set": {
"status": FeedStatus.ERROR,
"last_error": result["error"],
"updated_at": datetime.now()
},
"$inc": {"error_count": 1}
}
)
async def update_all_feeds():
"""Update all active feeds that need updating"""
now = datetime.now()
# Find feeds that need updating
feeds = await db.feeds.find({
"status": FeedStatus.ACTIVE,
"$or": [
{"last_fetch": None},
{"last_fetch": {"$lt": now}}
]
}).to_list(100)
for feed in feeds:
# Check if it's time to update
if feed.get("last_fetch"):
time_diff = (now - feed["last_fetch"]).total_seconds()
if time_diff < feed.get("update_interval", settings.default_update_interval):
continue
# Update feed in background
await update_feed(str(feed["_id"]))
# API Endpoints
@app.get("/")
async def root():
return {
"service": "RSS Feed Service",
"version": "1.0.0",
"timestamp": datetime.now().isoformat(),
"endpoints": {
"subscribe": "POST /api/feeds",
"list_feeds": "GET /api/feeds",
"get_entries": "GET /api/entries",
"mark_read": "PUT /api/entries/{entry_id}/read",
"mark_starred": "PUT /api/entries/{entry_id}/star",
"statistics": "GET /api/stats"
}
}
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"service": "rss-feed",
"timestamp": datetime.now().isoformat()
}
@app.post("/api/feeds", response_model=FeedSubscription)
async def subscribe_to_feed(request: CreateFeedRequest, background_tasks: BackgroundTasks):
"""RSS/Atom 피드 구독"""
# Check if already subscribed
existing = await db.feeds.find_one({"url": str(request.url)})
if existing:
raise HTTPException(status_code=400, detail="이미 구독 중인 피드입니다")
# Parse feed to get metadata
result = await parser.parse_feed(str(request.url))
if not result["success"]:
raise HTTPException(status_code=400, detail=f"피드 파싱 실패: {result['error']}")
# Create subscription
feed = FeedSubscription(
title=request.title or result["feed"].get("title", "Untitled Feed"),
url=request.url,
description=result["feed"].get("description", ""),
category=request.category,
update_interval=request.update_interval or settings.default_update_interval
)
# Save to database - convert URL to string
feed_dict = feed.dict()
feed_dict["url"] = str(feed_dict["url"])
result = await db.feeds.insert_one(feed_dict)
feed.id = str(result.inserted_id)
# Fetch entries in background
background_tasks.add_task(update_feed, feed.id)
return feed
@app.get("/api/feeds", response_model=List[FeedSubscription])
async def list_feeds(
category: Optional[str] = Query(None, description="카테고리 필터"),
status: Optional[FeedStatus] = Query(None, description="상태 필터")
):
"""구독 중인 피드 목록 조회"""
query = {}
if category:
query["category"] = category
if status:
query["status"] = status
feeds = await db.feeds.find(query).to_list(100)
for feed in feeds:
feed["_id"] = str(feed["_id"])
return feeds
@app.get("/api/feeds/{feed_id}", response_model=FeedSubscription)
async def get_feed(feed_id: str = Path(..., description="피드 ID")):
"""특정 피드 정보 조회"""
feed = await db.feeds.find_one({"_id": feed_id})
if not feed:
raise HTTPException(status_code=404, detail="피드를 찾을 수 없습니다")
feed["_id"] = str(feed["_id"])
return feed
@app.put("/api/feeds/{feed_id}", response_model=FeedSubscription)
async def update_feed_subscription(
feed_id: str = Path(..., description="피드 ID"),
request: UpdateFeedRequest = ...
):
"""피드 구독 정보 수정"""
update_data = request.dict(exclude_unset=True)
if update_data:
update_data["updated_at"] = datetime.now()
result = await db.feeds.update_one(
{"_id": feed_id},
{"$set": update_data}
)
if result.matched_count == 0:
raise HTTPException(status_code=404, detail="피드를 찾을 수 없습니다")
feed = await db.feeds.find_one({"_id": feed_id})
feed["_id"] = str(feed["_id"])
return feed
@app.delete("/api/feeds/{feed_id}")
async def unsubscribe_from_feed(feed_id: str = Path(..., description="피드 ID")):
"""피드 구독 취소"""
# Delete feed
result = await db.feeds.delete_one({"_id": feed_id})
if result.deleted_count == 0:
raise HTTPException(status_code=404, detail="피드를 찾을 수 없습니다")
# Delete associated entries
await db.entries.delete_many({"feed_id": feed_id})
return {"message": "구독이 취소되었습니다"}
@app.post("/api/feeds/{feed_id}/refresh")
async def refresh_feed(
feed_id: str = Path(..., description="피드 ID"),
background_tasks: BackgroundTasks = ...
):
"""피드 수동 새로고침"""
feed = await db.feeds.find_one({"_id": feed_id})
if not feed:
raise HTTPException(status_code=404, detail="피드를 찾을 수 없습니다")
background_tasks.add_task(update_feed, feed_id)
return {"message": "피드 새로고침이 시작되었습니다"}
@app.get("/api/entries", response_model=List[FeedEntry])
async def get_entries(
feed_id: Optional[str] = Query(None, description="피드 ID"),
is_read: Optional[bool] = Query(None, description="읽음 상태 필터"),
is_starred: Optional[bool] = Query(None, description="별표 상태 필터"),
limit: int = Query(50, ge=1, le=100, description="결과 개수"),
offset: int = Query(0, ge=0, description="오프셋")
):
"""피드 엔트리 목록 조회"""
query = {}
if feed_id:
query["feed_id"] = feed_id
if is_read is not None:
query["is_read"] = is_read
if is_starred is not None:
query["is_starred"] = is_starred
entries = await db.entries.find(query) \
.sort("published", -1) \
.skip(offset) \
.limit(limit) \
.to_list(limit)
for entry in entries:
entry["_id"] = str(entry["_id"])
return entries
@app.get("/api/entries/{entry_id}", response_model=FeedEntry)
async def get_entry(entry_id: str = Path(..., description="엔트리 ID")):
"""특정 엔트리 조회"""
entry = await db.entries.find_one({"_id": entry_id})
if not entry:
raise HTTPException(status_code=404, detail="엔트리를 찾을 수 없습니다")
entry["_id"] = str(entry["_id"])
return entry
@app.put("/api/entries/{entry_id}/read")
async def mark_entry_as_read(
entry_id: str = Path(..., description="엔트리 ID"),
is_read: bool = Query(True, description="읽음 상태")
):
"""엔트리 읽음 상태 변경"""
result = await db.entries.update_one(
{"_id": entry_id},
{"$set": {"is_read": is_read}}
)
if result.matched_count == 0:
raise HTTPException(status_code=404, detail="엔트리를 찾을 수 없습니다")
return {"message": f"읽음 상태가 {is_read}로 변경되었습니다"}
@app.put("/api/entries/{entry_id}/star")
async def mark_entry_as_starred(
entry_id: str = Path(..., description="엔트리 ID"),
is_starred: bool = Query(True, description="별표 상태")
):
"""엔트리 별표 상태 변경"""
result = await db.entries.update_one(
{"_id": entry_id},
{"$set": {"is_starred": is_starred}}
)
if result.matched_count == 0:
raise HTTPException(status_code=404, detail="엔트리를 찾을 수 없습니다")
return {"message": f"별표 상태가 {is_starred}로 변경되었습니다"}
@app.post("/api/entries/mark-all-read")
async def mark_all_as_read(feed_id: Optional[str] = Query(None, description="피드 ID")):
"""모든 엔트리를 읽음으로 표시"""
query = {}
if feed_id:
query["feed_id"] = feed_id
result = await db.entries.update_many(
query,
{"$set": {"is_read": True}}
)
return {"message": f"{result.modified_count}개 엔트리가 읽음으로 표시되었습니다"}
@app.get("/api/stats", response_model=List[FeedStatistics])
async def get_statistics(feed_id: Optional[str] = Query(None, description="피드 ID")):
"""피드 통계 조회"""
if feed_id:
feeds = [await db.feeds.find_one({"_id": feed_id})]
if not feeds[0]:
raise HTTPException(status_code=404, detail="피드를 찾을 수 없습니다")
else:
feeds = await db.feeds.find().to_list(100)
stats = []
for feed in feeds:
feed_id = str(feed["_id"])
# Count entries
total = await db.entries.count_documents({"feed_id": feed_id})
unread = await db.entries.count_documents({"feed_id": feed_id, "is_read": False})
starred = await db.entries.count_documents({"feed_id": feed_id, "is_starred": True})
# Calculate error rate
error_rate = 0
if feed.get("error_count", 0) > 0:
total_fetches = feed.get("error_count", 0) + (1 if feed.get("last_fetch") else 0)
error_rate = feed.get("error_count", 0) / total_fetches
stats.append(FeedStatistics(
feed_id=feed_id,
total_entries=total,
unread_entries=unread,
starred_entries=starred,
last_update=feed.get("last_fetch"),
error_rate=error_rate
))
return stats
@app.get("/api/export/opml")
async def export_opml():
"""피드 목록을 OPML 형식으로 내보내기"""
feeds = await db.feeds.find().to_list(100)
opml = """<?xml version="1.0" encoding="UTF-8"?>
<opml version="2.0">
<head>
<title>RSS Feed Subscriptions</title>
<dateCreated>{}</dateCreated>
</head>
<body>""".format(datetime.now().isoformat())
for feed in feeds:
opml += f'\n <outline text="{feed["title"]}" xmlUrl="{feed["url"]}" type="rss" category="{feed.get("category", "")}" />'
opml += "\n</body>\n</opml>"
return {
"opml": opml,
"feed_count": len(feeds)
}

View File

@ -1,74 +0,0 @@
from pydantic import BaseModel, Field, HttpUrl
from typing import Optional, List, Dict, Any
from datetime import datetime
from enum import Enum
class FeedStatus(str, Enum):
ACTIVE = "active"
INACTIVE = "inactive"
ERROR = "error"
class FeedCategory(str, Enum):
NEWS = "news"
TECH = "tech"
BUSINESS = "business"
SCIENCE = "science"
HEALTH = "health"
SPORTS = "sports"
ENTERTAINMENT = "entertainment"
LIFESTYLE = "lifestyle"
POLITICS = "politics"
OTHER = "other"
class FeedSubscription(BaseModel):
id: Optional[str] = Field(None, alias="_id")
title: str
url: HttpUrl
description: Optional[str] = None
category: FeedCategory = FeedCategory.OTHER
status: FeedStatus = FeedStatus.ACTIVE
update_interval: int = 900 # seconds
last_fetch: Optional[datetime] = None
last_error: Optional[str] = None
error_count: int = 0
created_at: datetime = Field(default_factory=datetime.now)
updated_at: datetime = Field(default_factory=datetime.now)
metadata: Dict[str, Any] = {}
class FeedEntry(BaseModel):
id: Optional[str] = Field(None, alias="_id")
feed_id: str
entry_id: str # RSS entry unique ID
title: str
link: str
summary: Optional[str] = None
content: Optional[str] = None
author: Optional[str] = None
published: Optional[datetime] = None
updated: Optional[datetime] = None
categories: List[str] = []
thumbnail: Optional[str] = None
enclosures: List[Dict[str, Any]] = []
is_read: bool = False
is_starred: bool = False
created_at: datetime = Field(default_factory=datetime.now)
class CreateFeedRequest(BaseModel):
url: HttpUrl
title: Optional[str] = None
category: FeedCategory = FeedCategory.OTHER
update_interval: Optional[int] = 900
class UpdateFeedRequest(BaseModel):
title: Optional[str] = None
category: Optional[FeedCategory] = None
update_interval: Optional[int] = None
status: Optional[FeedStatus] = None
class FeedStatistics(BaseModel):
feed_id: str
total_entries: int
unread_entries: int
starred_entries: int
last_update: Optional[datetime]
error_rate: float

View File

@ -1,14 +0,0 @@
fastapi==0.109.0
uvicorn[standard]==0.27.0
pydantic==2.5.3
pydantic-settings==2.1.0
feedparser==6.0.11
httpx==0.26.0
pymongo==4.6.1
motor==3.3.2
redis==5.0.1
python-dateutil==2.8.2
beautifulsoup4==4.12.3
lxml==5.1.0
apscheduler==3.10.4
pytz==2024.1