복구 절차¶

semu-gpt-dev.bootalk.co.kr 또는 api.semugpt.co.kr이 5xx를 반환할 때 따라가는 진단·복구 흐름. Lightsail 백엔드의 가장 흔한 장애 원인은 디스크 풀 — systemd는 active로 보이지만 actuator/health가 DOWN을 반환한다. 복구 명령은 모두 ssh semugpt-aws(SSH alias) 또는 aws --profile ob(SSO 토큰 필요)로 실행된다.

진단 결정 트리¶

flowchart TD
    A[사용자/모니터링 보고:
5xx 또는 health DOWN] --> B{actuator/health
외부에서 호출}
    B -->|200 UP| C[정상 - 사용자 측
네트워크/캐시 의심]
    B -->|503/timeout| D[SSH로 호스트 점검]

    D --> E{sudo systemctl status
semugpt-backend}
    E -->|inactive failed| F[백엔드 재시작]
    E -->|active running| G{df -h /
디스크 사용률}

    G -->|Use% < 90| H[docker compose ps
cloudflared 상태 확인]
    G -->|Use% ≥ 90| I[디스크 정리 runbook]

    H --> J{cloudflared active?}
    J -->|no| K[cloudflared 재시작]
    J -->|yes| L{infra 컨테이너 Up?}
    L -->|no| M[docker compose restart]
    L -->|yes| N[journalctl로 stack trace 확인]

    F --> Z[health 재확인]
    I --> Z
    K --> Z
    M --> Z
    N --> Z
    Z -->|UP| END[복구 완료]
    Z -->|여전히 DOWN| LAST[마지막 수단:
Lightsail reboot]

외부에서 한 줄 진단¶

# Dev 백엔드 + 프론트엔드 동시 확인
curl -s -o /dev/null -w "BACKEND  %{http_code} (%{time_total}s)\n" --max-time 10 https://semu-gpt-dev.bootalk.co.kr/actuator/health
curl -s -o /dev/null -w "FRONTEND %{http_code} (%{time_total}s)\n" --max-time 10 https://semu-chat-dev.bootalk.co.kr/

# Prod 백엔드 (legacy)
curl -s https://api.semugpt.co.kr/actuator/health

200/UP이면 정상. 5xx 또는 DOWN이면 아래 절차로 진입.

SSH 진단 (Lightsail 호스트)¶

# 디스크 + 시스템 상태 일괄 확인
ssh semugpt-aws '
  echo "=== 디스크 ==="; df -h /
  echo "=== systemd ==="; sudo systemctl status semugpt-backend cloudflared --no-pager | head -30
  echo "=== docker ==="; sudo docker compose -f /opt/semugpt-2026/docker-compose.dev.yml ps
  echo "=== local health ==="; curl -s http://localhost:8080/actuator/health
'

증상	의심 원인	다음 단계
`df -h /` Use% 100%	디스크 풀 (가장 흔함)	"디스크 정리" 섹션
systemd `inactive (dead)` 또는 `failed`	백엔드 크래시 또는 OOM	"백엔드 재시작" + `journalctl` 확인
systemd `active` but localhost:8080 timeout	JVM hang (heap/GC)	"백엔드 재시작"
Docker 컨테이너 unhealthy	infra(mysql/es/redis) 문제	"인프라 컨테이너 재시작"
cloudflared `inactive`	Cloudflare Tunnel 끊김 → 502	"Cloudflare Tunnel 재시작"
모든 게 정상인데 5xx	ALB / Cloudflare 측 일시 장애	5분 대기 후 재시도

복구 명령¶

백엔드 재시작 (가장 흔한 경우)¶

ssh semugpt-aws 'sudo systemctl restart semugpt-backend'
sleep 90   # JVM 시작 60-90초
ssh semugpt-aws 'curl -s http://localhost:8080/actuator/health'

Restart=on-failure + RestartSec=30s 설정이라 throttle 30초 후 자동 재시작도 가능 (하지만 root cause는 확인 필요).

인프라 컨테이너 재시작 (mysql / es / redis)¶

ssh semugpt-aws 'cd /opt/semugpt-2026 && sudo docker compose -f docker-compose.dev.yml restart'

이후 백엔드도 같이 재시작 (DB 커넥션 풀 재설정 필요).

Cloudflare Tunnel 재시작¶

502가 외부에서 보이지만 localhost:8080은 정상일 때.

ssh semugpt-aws 'sudo systemctl restart cloudflared'

마지막 수단: Lightsail 인스턴스 재부팅¶

SSH도 응답하지 않을 때만 사용.

aws --profile ob lightsail reboot-instance --instance-name semugpt-backend
# 2-3분 후 다시 시도

SSO 토큰 만료 시 aws sso login --profile ob 먼저 실행.

디스크 정리 (Use% ≥90%)¶

가장 흔한 culprit은 /var/log/syslog.1 (p6spy SQL 로그 spam) + ~/.gradle/caches (빌드 캐시). 근본 원인은 systemd drop-in quiet-logs.conf로 영구 fix됨 — 그래도 다른 원인으로 차오를 수 있어 cleanup runbook을 보존한다.

ssh semugpt-aws '
  sudo truncate -s 0 /var/log/syslog.1                                       # 보통 가장 큼 (수십 GB)
  sudo find /home/ubuntu/.gradle/caches -mindepth 1 -delete 2>/dev/null      # Gradle 의존성 캐시 (~20GB)
  sudo docker system prune -af --filter "until=72h"                          # 미사용 docker 이미지/캐시
  df -h /
  sudo systemctl restart semugpt-backend
'

~/.gradle/wrapper는 보존 — 삭제하면 재시작 시 Gradle 7.6 재다운로드 (5-10분 추가 소요).

자세한 모니터링·threshold 변경은 모니터링 페이지 참조.

자동 복구 레이어¶

각 계층은 일정 수준의 자동 복구를 가지므로, 일시적 장애는 사용자 개입 없이 회복된다.

레이어	시스템	자동 동작
Cloudflare Tunnel	Lightsail systemd `cloudflared`	systemd KeepAlive (기본)
Spring Boot	Lightsail systemd `semugpt-backend`	`Restart=on-failure`, `RestartSec=30s`, throttle 30s
Docker infra	Docker engine	`restart=unless-stopped` (mysql/es/redis 모두)
AWS Lightsail VM	AWS	infra HA — instance crash 시 자동 재시작 가능 ⚠️ (검증 필요)

Lightsail 재부팅 시 부팅 순서: docker → semugpt-backend → cloudflared 모두 자동 복구 — 사용자 개입 불필요.

사용자 개입이 필요한 시점¶

자동 복구가 못 잡는 케이스.

시나리오	자동 복구	사람이 해야 할 일
디스크 풀	❌ 안 됨	"디스크 정리" runbook 실행
Docker volume 손상 (ES index corruption 등)	❌ 안 됨	volume 재생성 + 재인덱싱
RDS down (prod)	❌ AWS 의존	AWS Service Health Dashboard 확인
Cloudflare 측 outage	❌ Cloudflare 의존	status.cloudflare.com 확인
Gradle build 실패로 systemd start 실패	❌ 코드 문제	`journalctl -u semugpt-backend -f`로 stack trace 확인 후 코드 수정

Dev 백엔드 정지 (로컬 개발 시)¶

집중 개발 시 Lightsail 백엔드를 잠시 끄고 로컬 Mac에서 backend를 띄울 수 있다 — dev URL이 일시적으로 502가 됨 (클라이언트 데모 시간 피해서).

ssh semugpt-aws 'sudo systemctl stop semugpt-backend'

# 로컬에서 작업
cd apps/backend
./gradlew bootRun --args='--spring.profiles.active=local'

# 끝나면 반드시 복귀
ssh semugpt-aws 'sudo systemctl start semugpt-backend'