Troubleshooting and verification steps
The first thing to consider when facing Elemental issues is to acknowledge in which process or phase the issue appears. These are the phases or stages of a regular Elemental life cycle:
-
Create a MachineRegistration resource
-
The user provides node installation and configuration parameters.
-
The Elemental operator generates a token based registration URL.
-
-
Create a SeedImage resource
-
Builds and serves an ISO or RAW image with the selected OS and including the registration URL of the given MachineRegistration.
-
-
Register and installation of nodes
-
Boot an ISO or RAW from a SeedImage and it auto-registers creating a MachineInventory.
-
Installation starts and reboots to the installed system applying the configuration that was given in the associated MachineRegistartion.
-
-
Creation of a new Elemental cluster
-
The new cluster uses the node selector criteria to adopt matching MachineInventories.
-
The Elemental operator adds a finalizer to the adopted MachineInventories to handle the reset use case.
-
-
K8s provisioning
-
The Elemental operator triggers Rancher provisioning scripts with the
elemental-system-agentservice. -
Rancher handles the rest of the Kubernetes provisioning at this point. Provisioning system installs
rancher-system-agentservice in nodes which will follow and execute the plans provided by the management cluster.
-
-
Create a ManagedOSImage resource (OS Upgrade)
-
Creates a System Upgrade Controller (SUC) plan which runs the OSImage as a pod in the downstream cluster on each node one by one to self dump into a new snapshot.
-
-
Kubernetes upgrade
-
Entirely managed by Rancher, there are no Elemental specific procedures at this stage.
-
What to check in different phases
These are a few checks and validations that should be considered to narrow and better scope the issue.
Issues building the installation media (SeedImage)
-
Check the associated SeedImage resource status and check the related pod and its logs (a pod named with the
media-image-regprefix). -
If the
seedimagepod is not launched, theelemental-operatorpod logs related to SeedImage resources will be helpful.
Issues creating the MachineInventory (image boot + register + OS install)
-
The installer media does not register.
-
Check in the SeedImage if the
livecd-cloud-config.yamlis consistent with an active MachineRegistration in Rancher. Then check if the node has access to the URL and, finally, check the logs of theelemental-register-install.service.
-
-
The MachineInventory is created but never turns into active state.
-
Check if the
elemental-register-install.servicefailed, and if so, check the service logs. -
If the installation succeeded but there was no reboot, then check if the MachineRegistration has the reboot set to
truein the install section. -
If the system rebooted but failed to finalize registration, check the
elemental-register.servicelogs.
-
Issues assigning machines to a cluster
Check all values are consistent: labels in nodes vs the selector criteria in the new cluster and the number of nodes the cluster is defined for. Otherwise check for related errors in the elemental-operator logs (check the traces for MachineInventory and MachineInventorySelector resources).
Issues provisioning Kubernetes
Elemental triggers Rancher provisioning via the elemental-system-agent. If the elemental-system-agent does not report errors the root cause of any issue is likely to be related with the Rancher provisioning process.
Issues upgrading nodes OS
Check if the System Upgrade Controller (SUC) plan is created and launched to downstream clusters. If this was successful, check and provide the logs for the pod that the SUC launched in the downstream cluster (a pod named with the apply-os-upgrader prefix). Downgrades are not allowed by default, so check if both versions of the OS are acceptable, the current version and the version to upgrade to.