数字档案馆系统稳定性:从零搭建高可用架构实操指南
一、系统稳定性核心指标与监控方案
在搭建数字档案馆系统前,首先要明确稳定性指标。监控以下四个核心指标:
1.1 可用性监控
创建监控脚本,每5分钟检测服务状态:
```bash !/bin/bash /opt/monitor/availability_check.sh TIMESTAMP=$(date +%s) STATUS_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health) if [ "$STATUS_CODE" -eq 200 ]; then echo "$TIMESTAMP,1" >> /var/log/availability.log else echo "$TIMESTAMP,0" >> /var/log/availability.log systemctl restart archive-service fi ```将脚本加入crontab:
```bash /5 /opt/monitor/availability_check.sh ```1.2 性能监控配置
安装Prometheus监控系统:
```bash wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz tar xvf prometheus-2.45.0.linux-amd64.tar.gz cd prometheus-2.45.0 ```创建配置文件:
```yaml prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: 'archive-system' static_configs: - targets: ['localhost:9090', 'localhost:8080'] ```二、数据库高可用架构部署
2.1 PostgreSQL主从复制配置
在主服务器上配置:
```bash /etc/postgresql/14/main/postgresql.conf wal_level = replica max_wal_senders = 10 wal_keep_size = 1GB ```创建复制用户:
```sql CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'your_secure_password'; ```在从服务器上配置恢复文件:
```bash /var/lib/postgresql/14/main/standby.signal 空文件,表示这是备库 /var/lib/postgresql/14/main/postgresql.auto.conf primary_conninfo = 'host=主服务器IP port=5432 user=replicator password=your_secure_password' ```2.2 自动故障切换设置
安装Patroni实现自动故障转移:
```bash pip3 install patroni[etcd] ```创建Patroni配置文件:
```yaml /etc/patroni.yml scope: archive-cluster name: node1 restapi: listen: 0.0.0.0:8008 connect_address: 本机IP:8008 etcd: hosts: ["etcd1:2379", "etcd2:2379", "etcd3:2379"] postgresql: listen: 0.0.0.0:5432 connect_address: 本机IP:5432 data_dir: /var/lib/postgresql/14/main parameters: max_connections: 200 shared_buffers: 1GB ```三、应用服务无状态化部署
3.1 Docker容器化配置
创建Dockerfile:
```dockerfile FROM openjdk:11-jre-slim WORKDIR /app COPY target/archive-system.jar app.jar COPY config/application.yml config/ EXPOSE 8080 ENTRYPOINT ["java", "-jar", "app.jar", "--spring.config.location=file:/app/config/application.yml"] ```创建docker-compose编排文件:
```yaml version: '3.8' services: app: build: . ports: - "8080:8080" environment: - SPRING_PROFILES_ACTIVE=production - DB_HOST=${DB_HOST} volumes: - ./logs:/app/logs - ./uploads:/app/uploads restart: unless-stopped healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3 ```3.2 Nginx负载均衡配置
配置负载均衡和健康检查:
```nginx upstream archive_backend { least_conn; server 192.168.1.101:8080 max_fails=3 fail_timeout=30s; server 192.168.1.102:8080 max_fails=3 fail_timeout=30s; server 192.168.1.103:8080 max_fails=3 fail_timeout=30s; check interval=3000 rise=2 fall=5 timeout=1000 type=http; check_http_send "HEAD /health HTTP/1.0\r\n\r\n"; check_http_expect_alive http_2xx http_3xx; } server { listen 80; server_name archive.example.com; location / { proxy_pass http://archive_backend; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } } ```四、文件存储高可用方案
4.1 MinIO分布式存储部署
部署4节点MinIO集群:
```bash 每个节点执行 wget https://dl.min.io/server/minio/release/linux-amd64/minio chmod +x minio MINIO_ROOT_USER=admin MINIO_ROOT_PASSWORD=your_password ./minio server \ http://node1/data{1..4} \ http://node2/data{1..4} \ http://node3/data{1..4} \ http://node4/data{1..4} ```4.2 文件同步策略配置

创建文件同步脚本:
```bash !/bin/bash /opt/sync/archive_sync.sh SOURCE_DIR="/data/uploads" DEST_MINIO="http://minio-cluster:9000" BUCKET="archive-documents" 使用mc命令同步 /usr/local/bin/mc mirror --watch --overwrite "$SOURCE_DIR" "$DEST_MINIO/$BUCKET" ```配置inotify实时监控:
```bash apt-get install inotify-tools 创建监控脚本 cat > /opt/sync/watch_files.sh << 'EOF' !/bin/bash inotifywait -m -r -e create -e modify -e move -e delete /data/uploads | while read path action file; do mc cp "$path$file" minio-cluster/archive-documents/ done EOF ```五、数据备份与恢复机制
5.1 全量备份策略
创建数据库全量备份脚本:
```bash !/bin/bash /opt/backup/full_backup.sh BACKUP_DIR="/backup/full" DATE=$(date +%Y%m%d_%H%M%S) PostgreSQL全量备份 pg_dumpall -h localhost -U postgres | gzip > "$BACKUP_DIR/db_full_$DATE.sql.gz" 文件备份 tar -czf "$BACKUP_DIR/files_$DATE.tar.gz" /data/uploads 保留最近7天备份 find "$BACKUP_DIR" -type f -mtime +7 -delete ```5.2 增量备份实现
配置WAL归档备份:
```bash postgresql.conf配置 archive_mode = on archive_command = 'cp %p /backup/wal/%f' ```创建增量备份恢复脚本:
```bash !/bin/bash /opt/backup/restore.sh LAST_BACKUP=$(ls -t /backup/full/db_full_.sql.gz | head -1) 恢复基础备份 gunzip -c "$LAST_BACKUP" | psql -U postgres 应用WAL日志 pg_archivecleanup /backup/wal/ 000000010000000000000001 ```六、压力测试与性能调优
6.1 压力测试执行
使用wrk进行压力测试:
```bash 安装wrk git clone https://github.com/wg/wrk.git cd wrk && make 执行测试 ./wrk -t12 -c400 -d30s --latency http://archive.example.com/api/documents ```创建测试报告脚本:
```bash !/bin/bash /opt/test/performance_test.sh echo "开始压力测试..." ./wrk -t12 -c400 -d60s --latency http://localhost:8080/api/documents > test_result.txt echo "测试结果:" cat test_result.txt 提取关键指标 LATENCY=$(grep "Latency" test_result.txt | awk '{print $2}') REQUESTS=$(grep "Requests/sec" test_result.txt | awk '{print $2}') echo "平均延迟: $LATENCY" echo "每秒请求数: $REQUESTS" ```6.2 JVM性能优化
配置JVM参数:
```bash application启动参数 java -Xms2g -Xmx2g -XX:+UseG1GC \ -XX:MaxGCPauseMillis=200 \ -XX:+UnlockExperimentalVMOptions \ -XX:+UseContainerSupport \ -jar archive-system.jar ```七、日志收集与告警系统
7.1 ELK日志收集配置
安装Filebeat:
```bash wget https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.10.0-linux-x86_64.tar.gz tar xzvf filebeat-8.10.0-linux-x86_64.tar.gz ```配置Filebeat:
```yaml filebeat.yml filebeat.inputs: - type: log paths: - /app/logs/.log output.elasticsearch: hosts: ["elasticsearch:9200"] indices: - index: "archive-logs-%{+yyyy.MM.dd}" ```7.2 告警规则配置
配置Prometheus告警规则:
```yaml alert_rules.yml groups: - name: archive_alerts rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "高错误率报警" description: "5xx错误率超过10%" - alert: ServiceDown expr: up{job="archive-system"} == 0 for: 1m labels: severity: critical annotations: summary: "服务宕机" description: "{{ $labels.instance }} 服务不可用" ```八、灾难恢复演练
8.1 故障切换演练脚本
创建数据库切换测试脚本:
```bash !/bin/bash /opt/drill/failover_test.sh echo "1. 停止主数据库..." systemctl stop postgresql@14-main echo "2. 等待30秒模拟故障..." sleep 30 echo "3. 检查从库是否提升为主库..." IS_MASTER=$(psql -h 从库IP -U postgres -t -c "SELECT pg_is_in_recovery()") if [ "$IS_MASTER" = "f" ]; then echo "✓ 故障切换成功" else echo "✗ 故障切换失败" exit 1 fi echo "4. 恢复原主库..." systemctl start postgresql@14-main ```8.2 数据恢复验证
创建数据完整性验证脚本:
```bash !/bin/bash /opt/drill/data_validation.sh echo "验证数据完整性..." DB_COUNT=$(psql -U postgres -d archive_db -t -c "SELECT COUNT() FROM documents") FILE_COUNT=$(find /data/uploads -type f | wc -l) echo "数据库记录数: $DB_COUNT" echo "文件数量: $FILE_COUNT" if [ "$DB_COUNT" -eq "$FILE_COUNT" ]; then echo "✓ 数据完整性验证通过" else echo "✗ 数据不一致,需要检查" exit 1 fi ```按照以上步骤部署完成后,系统将具备以下稳定性特征:数据库自动故障切换、应用服务负载均衡、文件存储多副本、完整的监控告警体系。每月执行一次灾难恢复演练,确保所有组件在真实故障时能按预期工作。