VerteX Air-Gapped Install on EKS¶
Self-hosted Palette VerteX deployment on AWS EKS in the Recro internal account. Air-gapped: private-only EKS API, no NAT, all images pulled from ECR via VPC endpoints.
The bastion is fully autonomous — apply Terraform, wait ~40 minutes, VerteX is running. No manual intervention.
Architecture¶
S3 Bundle Bucket (vertex-ecr module)
├── charts.zip, palette-vertex-appliance-4.8.40.tar.zst
├── index.json, oci-layout (OCI metadata)
├── Scripts (vertex-mirror-pull, ecr-mirror, cluster-setup, full-setup, dns-update)
└── Helm values (cert-manager, image-swap, vertex airgap configs)
ECR (vertex-ecr module)
└── 128 repos under vertex-bootstrap/* (350 container images)
vertex-bootstrap module
├── VPC 10.1.0.0/16 (private + public subnets, no NAT)
├── VPC Endpoints (ECR, S3, EKS, STS, logs, EC2, SSM)
├── EKS cluster (K8s 1.32, private-only, 3x m5.2xlarge)
├── Client VPN (mutual TLS, split tunnel, 10.250.0.0/22)
└── Bastion (t3.large, public subnet, SSM access)
On boot:
1. Install tools (kubectl, helm, oras, docker)
2. Download scripts + values from S3
3. Pull bundle from S3 + extract (17 GB)
4. Wait for EKS cluster ACTIVE
5. Wait for nodes Ready + create gp3 StorageClass
6. Check ECR → mirror if empty (~20 min first time, skip on rebuild)
7. Helm install cert-manager, image-swap, hubble
8. Update DNS (vertex.recrocog.com → NLB)
9. Export VPN .ovpn config to S3
10. VerteX running
Prerequisites (one-time)¶
1. Spectro Cloud credentials¶
Download VerteX installer files from Artifact Studio.
Shared credentials. Do not commit.
2. Download from Artifact Studio¶
Artifact Studio portal → Palette VerteX → Releases → 4.8.40
| File | Size | Purpose |
|---|---|---|
charts.zip |
~390 KB | Helm charts (cert-manager, image-swap, VerteX mgmt plane) |
palette-vertex-appliance-4.8.40.tar.zst |
~17 GB | Container images + Spectro packs |
3. Extract OCI metadata from the bundle¶
The tar.zst contains index.json, index.json.lock, and oci-layout that the mirror script needs. Extract them separately:
tar --use-compress-program=unzstd -xf palette-vertex-appliance-4.8.40.tar.zst \
-C /tmp index.json index.json.lock oci-layout
4. AWS access¶
Member of the recro GitHub org with access to recro-aws-iac.
aws sso login --profile <your-profile>
git clone https://github.com/recro/recro-aws-iac.git
cd recro-aws-iac
5. Local tools¶
git,terraform1.10+,awsCLI v2,aws-session-manager-plugin
First-Time Setup¶
Step 1: Apply vertex_ecr (S3 bucket + ECR repos)¶
cd terraform && make init
terraform apply \
-var="region=us-east-1" \
-var-file=sso.tfvars -var-file=dns.tfvars -var-file=eks.tfvars \
-var-file=cognito.tfvars -var-file=resource-cleanup.tfvars \
-var-file=resource-startup.tfvars -var-file=manual-cleanup-web-app.tfvars \
-var-file=vertex.tfvars \
-target='module.vertex_ecr[0]'
Step 2: Upload bundle files to S3¶
aws s3 cp charts.zip s3://891377028731-vertex-bootstrap-bundle/
aws s3 cp palette-vertex-appliance-4.8.40.tar.zst s3://891377028731-vertex-bootstrap-bundle/
# Upload OCI metadata (extracted in prerequisite step 3)
aws s3 cp /tmp/index.json s3://891377028731-vertex-bootstrap-bundle/
aws s3 cp /tmp/index.json.lock s3://891377028731-vertex-bootstrap-bundle/
aws s3 cp /tmp/oci-layout s3://891377028731-vertex-bootstrap-bundle/
The 17 GB upload takes 60-90 min on residential internet. This is a one-time step — the files persist in S3 across rebuilds.
Step 3: Apply vertex_bootstrap¶
Via CI (recommended):
Tag must contain "vertex" to trigger the vertex-scoped CI apply. ~30-40 min total.
Or locally:
Step 4: Wait¶
The bastion does everything automatically:
- Install tools (~3 min)
- Download scripts + values from S3 (~10 sec)
- Pull 17 GB bundle from S3 + extract (~10 min)
- Wait for EKS cluster ACTIVE (~12 min, overlaps with bastion boot)
- Create gp3 StorageClass (~5 sec)
- Mirror ~350 images to ECR (~20 min, first time only — skipped if ECR already populated)
- Push Spectro packs to ECR (~2 min)
- Helm install cert-manager + image-swap (~5 min)
- Restore K8s Secrets from
latest.tgz(~30 sec, fresh install only) — preserves master encryption key - Helm install hubble (~8-10 min)
- Restore mongo from
latest.tgz(~1 min, fresh install only) — all prior data comes back: activation, System ID, users, tenants - Refresh CoreDNS with current VPC endpoint IPs
- Update DNS
vertex.recrocog.com→ NLB (public + private Route53 zones) - Export fresh
.ovpnto S3
Total first-time (empty ECR, no backup): ~40 min. Rebuild (ECR pre-populated, with backup): ~25 min — cluster comes back with all prior data, no manual intervention.
Step 5: Access VerteX console via VPN¶
The bastion auto-exports a .ovpn config file to S3 after each rebuild. Download it and connect.
First time — install AWS VPN Client:
Download from aws.amazon.com/vpn/client-vpn-download
Download .ovpn config (re-download after each rebuild):
Connect:
- Open AWS VPN Client
- File → Manage Profiles → Add Profile → select
vertex-vpn.ovpn - Click Connect
- Browse
https://vertex.recrocog.com/system
Login: admin / set your password (14+ chars, upper, lower, digit, special)
kubectl also works directly while connected:
Fallback — SSM tunnel (if VPN is unavailable):
BASTION=$(aws ec2 describe-instances --region us-east-1 \
--filters "Name=tag:Name,Values=vertex-bootstrap-bastion" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[0].Instances[0].InstanceId' --output text)
aws ssm start-session --region us-east-1 --target $BASTION \
--document-name AWS-StartPortForwardingSessionToRemoteHost \
--parameters '{"host":["<NLB-hostname>"],"portNumber":["443"],"localPortNumber":["8443"]}'
Browse: https://localhost:8443/system
Destroy / Rebuild¶
Destroy (save costs)¶
git tag <date>-vertex -m "Destroy"
# Or locally:
terraform destroy -target='module.vertex_bootstrap[0]' -auto-approve ...
Destroys: VPC, EKS, bastion, VPN, all K8s workloads Keeps: ECR repos (350 images), S3 bucket (bundle + scripts), KMS keys, ACM certs
Rebuild¶
Bastion boots, skips ECR mirror (images already there), installs Helm, VerteX up in ~25 min. DNS auto-updates. New .ovpn uploaded to S3 — re-download and reconnect.
Monitoring & Troubleshooting¶
Check bastion logs¶
aws ssm start-session --region us-east-1 --target <bastion-id>
sudo -i
cat /var/log/vertex-full-setup.log # Full orchestrator
cat /var/log/vertex-ecr-mirror.log # ECR mirror
cat /var/log/vertex-cluster-setup.log # Node wait + StorageClass
cat /var/log/vertex-dns-update.log # DNS update
cat /var/log/vertex-bastion-bootstrap.log # Tool versions
Re-run orchestrator¶
Idempotent — skips steps already completed.
Common issues¶
| Issue | Symptom | Fix |
|---|---|---|
| Scripts missing from S3 | [boot] ERROR: Failed to download |
terraform apply -target='module.vertex_ecr[0]' |
| Cluster not ready | Cluster status is CREATING |
Wait — orchestrator retries every 30s for 20 min |
| Image pull failures | Pods ImagePullBackOff |
Check ECR mirror log, re-run sudo /usr/local/bin/vertex-ecr-mirror.sh |
| MongoDB PVCs pending | PVCs stuck Pending |
Check kubectl get sc — gp3 should be default |
| ELB blocks destroy | DependencyViolation on subnet |
Destroy provisioner handles this automatically (ELB + ENI + SG cleanup) |
| DNS stale | vertex.recrocog.com wrong NLB | sudo /usr/local/bin/vertex-dns-update.sh |
| VPN .ovpn stale | Can't connect after rebuild | Re-download: aws s3 cp s3://891377028731-vertex-bootstrap-bundle/vertex-vpn.ovpn . |
| VPN cert error | .ovpn missing cert/key | sudo /usr/local/bin/vertex-vpn-export.sh then re-download |
Quick Reference¶
| Thing | Value |
|---|---|
| Region | us-east-1 |
| EKS cluster | vertex-bootstrap |
| EKS endpoint | Private only |
| VPC CIDR | 10.1.0.0/16 |
| ECR registry | 891377028731.dkr.ecr.us-east-1.amazonaws.com |
| ECR repo prefix | vertex-bootstrap/ |
| S3 bundle bucket | s3://891377028731-vertex-bootstrap-bundle/ |
| Root domain | vertex.recrocog.com |
| VerteX version | 4.8.40 |
| Node sizing | 3x m5.2xlarge (8 vCPU, 32 GB, 110 GB) |
| VPN client CIDR | 10.250.0.0/22 |
| VPN auth | Mutual TLS (certs in ACM) |
| VPN .ovpn | s3://891377028731-vertex-bootstrap-bundle/vertex-vpn.ovpn |
| Spectro Artifact Studio | spectro / mV715z##spPSJC |
CI Tag Convention¶
| Tag | Scope |
|---|---|
<date> (e.g., 2026-04-17) |
Core infra only (SSO, DNS, recro-eks) — no vertex |
<date>-vertex (e.g., 2026-04-17-vertex) |
Vertex only (vertex_ecr + vertex_bootstrap) |
Adding a new module to main.tf: add a -target line to plan-core in the Makefile or non-vertex CI applies will skip it.
Scripts Reference¶
All scripts are Terraform-managed S3 objects (terraform/modules/vertex-ecr/scripts/). The bastion downloads them on boot.
Bastion-side (run from /usr/local/bin/ on bastion)¶
| Script | Purpose | Auto on boot? |
|---|---|---|
vertex-mirror-pull.sh |
Pull bundle + values from S3, extract | Yes |
vertex-ecr-mirror.sh |
Push OCI images to ECR | Yes (if ECR empty) |
vertex-pack-push.sh |
Push Spectro packs to ECR | Yes |
vertex-cluster-setup.sh |
Wait for nodes, create gp3 StorageClass | Yes |
vertex-secrets-restore.sh |
Pre-helm-install K8s Secrets restore (preserves master encryption key) | Yes (before helm install) |
vertex-full-setup.sh |
Orchestrator — runs all steps in order | Yes |
vertex-restore-backup.sh |
Post-helm mongo restore from s3://.../backups/latest.tgz |
Yes (on fresh install only) |
vertex-coredns-refresh.sh |
CoreDNS VPC endpoint IP freshness check | Yes (on boot + hourly systemd timer) |
vertex-dns-update.sh |
UPSERT vertex.recrocog.com in Route53 public + private zones |
Yes (after Helm) |
vertex-vpn-export.sh |
Export .ovpn config to S3 |
Yes (after DNS) |
In-cluster CronJobs (installed by vertex-full-setup.sh)¶
| CronJob | Schedule | Purpose |
|---|---|---|
vertex-mongo-backup |
Daily 02:00 UTC | Dump mongo + K8s Secrets, rotate 7 daily / 4 weekly / 3 monthly, update latest.tgz |
vertex-backup-verify |
Sundays 04:00 UTC | Integrity check of latest.tgz (extracts, validates mongo archive + secrets), writes backups/last-verify.json |
Config: /usr/local/bin/vertex-mirror-config.sh (injected by Terraform user_data)
Backup & Restore¶
Automated backup (in-cluster CronJob)¶
The vertex-mongo-backup CronJob runs daily at 02:00 UTC and writes a bundle to S3:
s3://891377028731-vertex-bootstrap-bundle/backups/
├── latest.tgz # most recent, always overwritten
├── daily/<Mon|Tue|...|Sun>.tgz # 7-day rotation
├── weekly/week-<NN>.tgz # 4-week rotation
└── monthly/<YYYY-MM>.tgz # 3-month rotation
Each bundle contains:
- dump.archive — mongodump of the entire hubbledb replica set
- secrets.json — kubectl get secrets -n hubble-system (stripped of runtime fields), excluding spectro-mongodb-replicaset-key and spectro-mongo-key (these must be regenerated fresh by Spectro's mongodb-key-manager Job on every install)
The vertex-backup-verify CronJob runs weekly and writes backups/last-verify.json so stale or broken backups are detected before you need them.
Automated restore on fresh install¶
When the bastion runs vertex-full-setup.sh against an empty hubble-system namespace, it restores automatically in this order:
-
Secrets first (
vertex-secrets-restore.sh, runs beforehelm install hubble): Pre-applies 30+ K8s Secrets fromlatest.tgzinto the empty namespace. Helm'slookuptemplate then reuses existingconfigserversecret.secretEncryptKey/hashSalt/rootKeyinstead of generating new ones. This preserves the master encryption key so restored mongo{cipher}...values stay decryptable. -
Helm install hubble — now installs with the preserved secrets.
-
Mongo restore (
vertex-restore-backup.sh, runs after helm install whenFRESH_INSTALL=true): Runsmongorestorefromdump.archivein the same bundle. The restored data matches the preserved encryption key → no crashloop cascade.
On rebuild (destroy + apply), the S3 bucket survives, so latest.tgz is still there and the new cluster comes up with all prior data intact: activation, System ID, users, tenants, registries. No manual intervention.