모니터링¶

Dev 백엔드(Lightsail)와 prod 백엔드(legacy EC2 / 신규 Lightsail 예정)의 health·디스크·헬스 알림 체계. 핵심 채널은 Slack incoming webhook — Lightsail box에 직접 설치된 두 개의 cron(disk-alarm, health-monitor)이 상태 전환 시점에만 알림을 던지는 구조다. Frontend는 Sentry, LLM 호출은 Langfuse trace로 관측한다.

알림 채널 개요¶

채널	대상	트리거	메시지 양
Slack webhook (Lightsail)	백엔드 호스트	디스크 ≥80% / health UP↔DOWN 상태 전환	전환 시점만 (소음 방지)
Sentry	프론트엔드	JS 런타임 에러, 미처리 promise reject	모든 발생 (sampling 정책 없음 — 검증 필요)
Langfuse trace	LLM 호출 7곳	모든 추론 자동 trace	모든 호출
AWS CloudWatch	RDS / ALB / EC2	metric 자체. 알림 미연동	수동 조회
`actuator/health`	백엔드 외부	외부 모니터링 URL pinger 미설치 — 수동 / Cloudflare-side 알림	수동

Slack 알림 시스템 (Lightsail)¶

Dev 백엔드(semu-gpt-dev.bootalk.co.kr) 호스트에 2026-05-04 설치된 두 개의 cron이 Slack incoming webhook으로 알림을 보낸다.

설치 위치¶

파일	역할	실행 주기
`/usr/local/bin/disk-alarm.sh`	root fs 사용률 ≥80% 감지	`/etc/cron.hourly/disk-alarm` — 매시
`/usr/local/bin/health-monitor.sh`	`localhost:8080/actuator/health` 호출 → UP/DOWN 판정	`/etc/cron.d/semugpt-health` — 5분마다
`/etc/semugpt/slack-webhook.url`	Slack incoming webhook URL (mode 0600, root-only)	git에 커밋 안 됨
`/etc/semugpt/slack-mention.txt`	알림 prefix mention (현재 `<!here> @주우철`)	수동 변경
`/var/lib/semugpt/last-health-state`	직전 health 상태 기록 (UP/DOWN) — 상태 전환 감지용	health-monitor가 갱신
`/var/log/DISK_ALARM`	디스크 alarm 발생 flag — 직전 상태 추적	disk-alarm이 생성/제거

알림 정책¶

상태 전환(state transition)에서만 발송 — 매 cron 실행마다 알림 보내면 소음이 됨
critical 알림에만 mention 포함: 디스크 풀, 백엔드 DOWN
recovery 알림은 mention 없음: 사용자 깨우지 않음

상태 확인¶

# 최근 24시간 알림 이력
ssh semugpt-aws 'sudo journalctl -t disk-alarm -t health-monitor --since "24 hours ago" --no-pager | tail -20'

# 현재 flag / 직전 상태
ssh semugpt-aws 'cat /var/log/DISK_ALARM 2>/dev/null; cat /var/lib/semugpt/last-health-state'

Webhook / mention 변경¶

ssh semugpt-aws 'echo "https://hooks.slack.com/services/NEW/URL" | sudo tee /etc/semugpt/slack-webhook.url'
ssh semugpt-aws 'echo "<@U01ABCDEF>" | sudo tee /etc/semugpt/slack-mention.txt'

# threshold 변경 (기본 80%) — 스크립트 내 THRESHOLD=80 라인 수정
ssh semugpt-aws 'sudo vi /usr/local/bin/disk-alarm.sh'

Prod 백엔드(Lightsail semugpt-prod, ⚠️ 미생성)에도 동일 패턴 적용 예정. CLAUDE.md "Production Infrastructure" 섹션 참조.

외부 헬스 체크¶

알림 시스템이 못 잡는 외부 reachability는 수동으로 확인한다.

# Dev 백엔드 (Cloudflare Tunnel → Lightsail)
curl -s -o /dev/null -w "BACKEND  %{http_code} (%{time_total}s)\n" \
  --max-time 10 https://semu-gpt-dev.bootalk.co.kr/actuator/health

# Dev 프론트엔드 (Cloudflare Workers)
curl -s -o /dev/null -w "FRONTEND %{http_code} (%{time_total}s)\n" \
  --max-time 10 https://semu-chat-dev.bootalk.co.kr/

# Prod (legacy 백엔드)
curl -s https://api.semugpt.co.kr/actuator/health

각각 200/UP이면 정상.

Sentry (프론트엔드)¶

항목	값
패키지	`@sentry/nextjs` v10
Org	`bootalk`
Project	`semugpt`
설정 위치	`apps/frontend/sentry.server.config.ts`, `apps/frontend/sentry.edge.config.ts`, `apps/frontend/sentry.client.config.ts` (#162 4.5d)
Wrap 방식	`withSentryConfig(nextConfig, ...)` in `apps/frontend/next.config.js`

⚠️ 백엔드 Sentry 미설치 — 백엔드 런타임 예외는 systemd journalctl -u semugpt-backend에서만 추적 가능. 자세한 내용은 알려진 이슈의 "인프라 한계" 절 참조.

Langfuse (LLM trace)¶

모든 LLM 호출 7곳(rag-final-answer, rag-final-answer-multiturn, rag-document-filter, hyde-generator, intent-router, query-rewrite, tax-category-inference, keyword-extraction)은 Langfuse에 trace를 남긴다.

용도	도구
프롬프트 버전 + config(`model`, `temperature`, `maxTokens`) 관리	Langfuse UI
실 trace 조회 (질문/응답/토큰/지연)	Langfuse UI 또는 `mcp__langfuse__listPrompts` 등 MCP
슬랙 / 페이저 연동	미연동 (수동 확인)

설정·모델 교체 절차는 프로젝트 루트 CLAUDE.md의 "Langfuse 프롬프트 모델 설정" 섹션 참조.

AWS CloudWatch (수동 조회)¶

알림 미연동. 필요 시 수동으로 metric을 조회한다.

# RDS FreeStorageSpace (직전 1시간)
aws --profile ob cloudwatch get-metric-statistics \
  --namespace AWS/RDS --metric-name FreeStorageSpace \
  --dimensions Name=DBInstanceIdentifier,Value=tax-gpt \
  --start-time "$(date -u -v -1H +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time   "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --period 300 --statistics Average

# Legacy EC2 상태
aws --profile ob ec2 describe-instances --instance-ids i-07aea223b0818ab0a \
  --query 'Reservations[0].Instances[0].{State:State.Name,IP:PublicIpAddress}' --output table

# ALB target health (legacy TG)
aws --profile ob elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:ap-northeast-2:023888247019:targetgroup/semu-gpt-instance/c290ed2d2bcca56a

모니터링 갭 (개선 여지)¶

백엔드 Sentry 미설치 — 런타임 예외는 journalctl에서만 조회 가능. 알림 채널 없음.
외부 reachability pinger 부재 — Cloudflare Workers 또는 외부 SaaS(UptimeRobot 등) 미연동. 사용자 보고가 1차 채널.
CloudWatch 알람 미설정 — RDS FreeStorageSpace, ALB 5xx, EC2 CPU 등에 자동 알람 없음.
Prod Lightsail에 동일 disk-alarm/health-monitor 미배포 — semugpt-prod box 생성 시 dev와 동일 패턴 복제 예정 (Issue #151 Phase 5).

추가 갭과 해결 방안은 알려진 이슈 참조.