Files
ansible_proxmox_VM/IMPROVEMENTS.md

561 lines
13 KiB
Markdown
Raw Normal View History

# IMPROVEMENTS GUIDE: Ansible Proxmox VM Role
## Summary of Changes
This document outlines the improvements made to your Ansible role for robustness, maintainability, and best practices.
### What Was Improved
1. **Task Modularization** - Split monolithic tasks into 6 logical stages
2. **Error Handling** - Added try-catch blocks with recovery strategies
3. **Idempotency** - Ensured all operations are safe to re-run
4. **Pre-flight Validation** - Comprehensive environment checks before execution
5. **Documentation** - Extensive inline comments and variable documentation
6. **Logging** - Rich task names and debug output for troubleshooting
---
## File Structure
### New/Modified Files
```
tasks/
├─ main.yml # REFACTORED: Now orchestrates subtasks
├─ preflight-checks.yml # NEW: Environment validation
├─ download-image.yml # IMPROVED: Better error handling & caching
├─ create-vm.yml # IMPROVED: Idempotent VM creation
├─ configure-vm.yml # IMPROVED: Disk, Cloud-Init, TPM, GPU with error handling
├─ create-template.yml # IMPROVED: Idempotent template conversion
├─ create-clones.yml # IMPROVED: Clone creation with validation
└─ helpers.yml # NEW: Utility tasks for common operations
defaults/
└─ main.yml # IMPROVED: Complete documentation & new options
templates/
├─ cloudinit_userdata.yaml.j2 # No changes
└─ cloudinit_vendor.yaml.j2 # No changes
```
---
## 1. TASK MODULARIZATION
### Before
All tasks were in a single `main.yml` file (~150+ lines), making it:
- Difficult to debug
- Hard to extend
- Not reusable
### After
Each stage has its own file:
| File | Purpose | Key Features |
|------|---------|--------------|
| `preflight-checks.yml` | Validate environment | Checks Proxmox, storage, SSH keys, IPs |
| `download-image.yml` | Get Debian image | Caching, retry logic, size verification |
| `create-vm.yml` | Create VM | Idempotent, error handling |
| `configure-vm.yml` | Configure VM | Disk, Cloud-Init, TPM, GPU all in one |
| `create-template.yml` | Make template | Skip if already templated |
| `create-clones.yml` | Deploy clones | Loop through clone list with validation |
| `helpers.yml` | Utilities | Reusable helper functions |
### Running Specific Stages
```bash
# Run only pre-flight checks
ansible-playbook tasks/main.yml --tags preflight
# Run everything except template/clone
ansible-playbook tasks/main.yml --skip-tags template,clones
# Run only clone creation
ansible-playbook tasks/main.yml --tags clones
# Run image download and VM creation only
ansible-playbook tasks/main.yml --tags image,vm
```
---
## 2. ERROR HANDLING
### Before
- Minimal error checking
- Tasks would fail silently or with generic errors
- No recovery paths
### After
Each major operation has:
**Block/Rescue Structure**
```yaml
block:
- name: "[CONFIG] Try to import disk"
command: qm importdisk ...
rescue:
- name: "[CONFIG] Handle import failure"
fail:
msg: "Clear error message with context"
```
**Retry Logic**
```yaml
register: result
retries: 3
delay: 5
until: result is succeeded
```
**Validation Checks**
```yaml
- name: "[VM] Verify VM was created"
stat:
path: "/etc/pve/qemu-server/{{ vm_id }}.conf"
register: vm_verify
failed_when: not vm_verify.stat.exists
```
### Error Messages Include
- What went wrong
- Which VM/resource was affected
- Next steps to fix
---
## 3. IDEMPOTENCY
### Before
- Running playbook twice would fail or cause issues
- Template conversion would fail if already templated
- No checks for existing resources
### After
All operations are idempotent:
**Check Before Action**
```yaml
- name: "Check if VM already exists"
stat:
path: "/etc/pve/qemu-server/{{ vm_id }}.conf"
register: vm_conf
- name: "Create VM"
command: qm create ...
when: not vm_conf.stat.exists
```
**Safe Re-runs**
- Already-created VMs are skipped
- Already-converted templates are skipped
- Already-deployed clones are skipped
- Image is cached and reused
**Result**: You can run the playbook 10 times safely!
---
## 4. PRE-FLIGHT CHECKS
### New `preflight-checks.yml`
Validates before starting:
✓ Proxmox is installed (`qm` command exists)
✓ User can run Proxmox commands (permissions)
✓ Storage pool exists and is accessible
✓ SSH key file exists and is readable
✓ VM IDs are unique (warns if conflict)
✓ Clone IDs are unique (warns if conflict)
✓ IP addresses are valid format
✓ Gateway and DNS are valid IPs
✓ Snippets directory exists
### Sample Output
```
[PREFLIGHT] Check if running on Proxmox host ... ok
[PREFLIGHT] Verify qm command is available ... ok
[PREFLIGHT] Check if user can run qm commands ... ok
[PREFLIGHT] Verify storage pool exists ... ok
[PREFLIGHT] Summary - All checks passed
```
---
## 5. IMPROVED DEFAULTS
### New Variables in `defaults/main.yml`
```yaml
# Retry settings
max_retries: 3
retry_delay: 5
# Timeout settings (seconds)
image_download_timeout: 300
vm_boot_timeout: 60
cloud_init_timeout: 120
# Debug mode
debug_mode: false
```
### Better Documentation
Each variable has:
- Purpose explanation
- Valid values
- Examples
- Security warnings
---
## 6. IDEMPOTENT TEMPLATE CONVERSION
### Before
```yaml
- name: Convert VM to template
command: qm template {{ vm_id }}
args:
creates: "/etc/pve/qemu-server/{{ vm_id }}.conf.lock"
```
`.lock` file doesn't exist; always runs
### After
```yaml
- name: "[TEMPLATE] Check if VM is already a template"
shell: "qm config {{ vm_id }} | grep -q 'template: 1'"
register: is_template
failed_when: false
- name: "[TEMPLATE] Convert VM to template"
command: "qm template {{ vm_id }}"
when: is_template.rc != 0
```
✅ Checks actual template status; skips if already templated
---
## 7. BETTER CLOUD-INIT HANDLING
### Before
- Snippets not validated
- SSH key lookup could fail silently
### After
```yaml
- name: "[CONFIG] Verify SSH key is readable"
stat:
path: "{{ ssh_key_path | expanduser }}"
register: ssh_key_stat
failed_when: not ssh_key_stat.stat.readable
- name: "[CONFIG] Copy SSH public key to snippets"
copy:
src: "{{ ssh_key_path | expanduser }}"
dest: "/var/lib/vz/snippets/{{ vm_id }}-sshkey.pub"
```
✓ Validates before use
✓ Proper error messages if missing
---
## 8. HELPER FUNCTIONS
### New `helpers.yml`
Reusable utility tasks:
| Helper | Function |
|--------|----------|
| `check_vm_exists` | Check if VM exists |
| `check_template` | Check if VM is template |
| `check_vm_status` | Get VM running status |
| `check_storage` | Check storage space |
| `validate_vm_id` | Validate VM ID format |
| `get_vm_info` | Read VM configuration |
| `list_vms` | List all VMs |
| `cleanup_snippets` | Remove old Cloud-Init snippets |
### Usage Example
```yaml
- name: "Verify VM exists"
include_tasks: helpers.yml
vars:
helper_task: check_vm_exists
target_vm_id: "{{ vm_id }}"
- name: "Print result"
debug:
msg: "VM exists: {{ vm_exists }}"
```
---
## 9. IMPROVED CLONE CREATION
### Before
- No validation of clone IDs
- No error handling per clone
- All-or-nothing approach
### After
```yaml
loop: "{{ clones }}"
loop_control:
loop_var: clone
block:
- name: "[CLONES] Check if clone already exists"
stat:
path: "/etc/pve/qemu-server/{{ clone.id }}.conf"
register: clone_conf
- name: "[CLONES] Clone VM"
command: qm clone {{ vm_id }} {{ clone.id }}
when: not clone_conf.stat.exists
rescue:
- name: "[CLONES] Handle error for this clone"
debug:
msg: "WARNING: Clone {{ clone.id }} failed, continuing with next..."
```
✓ Each clone is independent
✓ One failed clone doesn't stop others
✓ Clear logging of what succeeded/failed
---
## 10. RICH LOGGING AND PROGRESS
### Task Naming Convention
```
[STAGE] Action: description
├─ [PREFLIGHT] Check if running on Proxmox
├─ [IMAGE] Download Debian GenericCloud
├─ [VM] Create base VM
├─ [CONFIG] Configure disk
├─ [TEMPLATE] Convert to template
└─ [CLONES] Create clone 301
```
### Progress Display
**Start**
```
╔════════════════════════════════════════════════════════════╗
║ Proxmox VM Template & Clone Manager ║
║ Template VM: debian-template-base (ID: 150) ║
║ Storage: local-lvm ║
║ CPU: 4 cores | RAM: 4096MB ║
╚════════════════════════════════════════════════════════════╝
```
**End**
```
╔════════════════════════════════════════════════════════════╗
║ ✓ Playbook execution completed ║
║ Template VM: debian-template-base (ID: 150) ║
║ ✓ Converted to template ║
║ ✓ 2 clone(s) created ║
║ Next steps: ║
║ - Verify VMs: qm list ║
║ - Connect: ssh debian@<vm-ip> ║
║ - Check Cloud-Init: cloud-init status ║
╚════════════════════════════════════════════════════════════╝
```
---
## Usage Examples
### 1. Full Deployment
```bash
ansible-playbook tasks/main.yml -i inventory
```
Runs all stages: preflight → image → VM → configure → template → clones
### 2. Re-run Safely (Idempotent)
```bash
ansible-playbook tasks/main.yml -i inventory
```
Second run skips already-completed operations.
### 3. Template Only
If you want to update template without re-downloading image:
```bash
ansible-playbook tasks/main.yml \
-i inventory \
--skip-tags image,vm,clones
```
### 4. Clone Only
After template is created, add new clones:
```yaml
# Update defaults/main.yml
clones:
- id: 303
hostname: app03
ip: "192.168.1.83/24"
gateway: "192.168.1.1"
```
Then run:
```bash
ansible-playbook tasks/main.yml \
-i inventory \
--tags clones
```
### 5. Debug Output
```bash
ansible-playbook tasks/main.yml \
-i inventory \
-vvv
```
Shows all task details, command output, variable values.
---
## Migration from Old Version
### Step 1: Backup
```bash
cp -r ansible_proxmox_VM ansible_proxmox_VM.backup
```
### Step 2: Replace Files
Use the new versions:
- `tasks/main.yml` → orchestrator
- All `tasks/*.yml` files → new implementations
- `defaults/main.yml` → improved defaults
### Step 3: Test with Dry-Run
```bash
ansible-playbook tasks/main.yml \
-i inventory \
--check
```
Shows what would happen without making changes.
### Step 4: Run Normally
```bash
ansible-playbook tasks/main.yml -i inventory
```
---
## Best Practices Going Forward
1. **Always use tags** for partial execution
2. **Run preflight checks** before major changes
3. **Test with `--check`** before production
4. **Use `--skip-tags`** to avoid re-downloading images
5. **Monitor Cloud-Init** inside VMs: `cloud-init status`
6. **Keep backups** of `.orig` files (already present)
7. **Review error messages** carefully for context
---
## Security Improvements
### Password Management
```yaml
# OLD
ci_password: "SecurePass123"
# NEW - Use Vault
ci_password: "{{ vault_debian_password }}"
```
Create vault file:
```bash
ansible-vault create group_vars/proxmox/vault.yml
```
Add:
```yaml
vault_debian_password: "YourSecurePassword"
```
### SSH Key Validation
Before: SSH key could be missing → confusing error
After: Validates key exists and is readable
---
## Troubleshooting
### Problem: Playbook fails at preflight
**Solution**: Run preflight checks manually to see what's missing
```bash
ansible-playbook tasks/main.yml -i inventory --tags preflight -vvv
```
### Problem: VM already exists, need to recreate
**Solution**: Delete the old VM first
```bash
qm destroy {{ vm_id }}
```
Then re-run playbook (idempotent).
### Problem: Clone creation fails
**Solution**: Check clone configuration and IDs
```bash
qm list # See all VMs
```
Ensure clone IDs don't conflict with existing VMs.
### Problem: Cloud-Init not applying
**Solution**: Check snippets directory exists
```bash
ls -la /var/lib/vz/snippets/
```
Verify permissions are correct (644 for YAML files).
---
## Next Steps
Consider these additional improvements:
1. **Molecule Testing** - Add automated tests
2. **Vault Integration** - Secure password management
3. **Role Packaging** - Create Ansible Galaxy package
4. **Custom Filters** - For more complex logic
5. **Notification** - Send completion alerts (Slack, email)
6. **Metrics** - Track VM creation time, resource usage
7. **Cleanup Role** - Destroy VMs and templates
8. **Backup/Restore** - Template and clone backup
---
## Questions?
Refer to task inline comments for specifics. Each task file has extensive documentation.