Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Cloud Application Platform 1.5.2

31 Troubleshooting

Cloud stacks are complex, and debugging deployment issues often requires digging through multiple layers to find the information you need. Remember that the SUSE Cloud Foundry releases must be deployed in the correct order, and that each release must deploy successfully, with no failed pods, before deploying the next release.

31.1 Logging

There are two types of logs in a deployment of SUSE Cloud Application Platform, applications logs and component logs. The following provides a brief overview of each log type and how to retrieve them for monitoring and debugging use.

  • Application logs provide information specific to a given application that has been deployed to your Cloud Application Platform cluster and can be accessed through:

    • The cf CLI using the cf logs command

    • The application's log stream within the Stratos console

  • Access to logs for a given component of your Cloud Application Platform deployment can be obtained by:

    • The kubectl logs command

      The following example retrieves the logs of the router container of router-0 pod in the scf namespace

      tux > kubectl logs --namespace scf router-0 router
    • Direct access to the log files using the following:

      1. Open a shell to the container of the component using the kubectl exec command

      2. Navigate to the logs directory at /var/vcap/sys/logs, at which point there will be subdirectories containing the log files for access.

        tux > kubectl exec --stdin --tty --namespace scf router-0 /bin/bash
        
        router/0:/# cd /var/vcap/sys/log
        
        router/0:/var/vcap/sys/log# ls -R
        .:
        gorouter  loggregator_agent
        
        ./gorouter:
        access.log  gorouter.err.log  gorouter.log  post-start.err.log  post-start.log
        
        ./loggregator_agent:
        agent.log

31.2 Using Supportconfig

If you ever need to request support, or just want to generate detailed system information and logs, use the supportconfig utility. Run it with no options to collect basic system information, and also cluster logs including Docker, etcd, flannel, and Velum. supportconfig may give you all the information you need.

supportconfig -h prints the options. Read the "Gathering System Information for Support" chapter in any SUSE Linux Enterprise Administration Guide to learn more.

31.3 Deployment Is Taking Too Long

A deployment step seems to take too long, or you see that some pods are not in a ready state hours after all the others are ready, or a pod shows a lot of restarts. This example shows not-ready pods many hours after the others have become ready:

tux > kubectl get pods --namespace scf
NAME                     READY STATUS    RESTARTS  AGE
router-3137013061-wlhxb  0/1   Running   0         16h
routing-api-0            0/1   Running   0         16h

The Running status means the pod is bound to a node and all of its containers have been created. However, it is not Ready, which means it is not ready to service requests. Use kubectl to print a detailed description of pod events and status:

tux > kubectl describe pod --namespace scf router-3137013061-wlhxb

This prints a lot of information, including IP addresses, routine events, warnings, and errors. You should find the reason for the failure in this output.

Important
Important
Some Pods Show as Not Running

Some uaa and scf pods perform only deployment tasks, and it is normal for them to show as unready and Completed after they have completed their tasks, as these examples show:

tux > kubectl get pods --namespace uaa
secret-generation-1-z4nlz   0/1       Completed

tux > kubectl get pods --namespace scf
secret-generation-1-m6k2h       0/1       Completed
post-deployment-setup-1-hnpln   0/1       Completed
Some Pods Terminate and Restart during Deployment

When monitoring the status of a deployment, pods can be observed transitioning from a Running state to a Terminating state, then returning to a Running again.

If a RESTARTS count of 0 is maintained during this process, this is normal behavior and not due to failing pods. It is not necessary to stop the deployment. During deployment, pods will modify annotations on itself via the StatefulSet pod spec. In order to get the correct annotations on the running pod, it is stopped and restarted. Under normal circumstances, this behavior should only result in a pod restarting once.

31.4 Deleting and Rebuilding a Deployment

There may be times when you want to delete and rebuild a deployment, for example when there are errors in your scf-config-values.yaml file, you wish to test configuration changes, or a deployment fails and you want to try it again. This has five steps: first delete the StatefulSets of the namespace associated with the release or releases you want to re-deploy, then delete the release or releases, delete its namespace, then re-create the namespace and re-deploy the release.

The namespace is also deleted as part of the process because the SCF and UAA namespaces contain generated secrets which Helm is not aware of and will not remove when a release is deleted. When deleting a release, busy systems may encounter timeouts. By first deleting the StatefulSets, it ensures that this operation is more likely to succeed. Using the delete statefulsets command requires kubectl v1.9.6 or newer.

Use helm to see your releases:

tux > helm list
NAME            REVISION  UPDATED                  STATUS    CHART           NAMESPACE
susecf-console  1         Tue Aug 14 11:53:28 2018 DEPLOYED  console-3.1.0   stratos
susecf-scf      1         Tue Aug 14 10:58:16 2018 DEPLOYED  cf-2.20.3       scf
susecf-uaa      1         Tue Aug 14 10:49:30 2018 DEPLOYED  uaa-2.20.3      uaa

This example deletes the susecf-uaa release and uaa namespace:

tux > kubectl delete statefulsets --all --namespace uaa
statefulset "mysql" deleted
statefulset "uaa" deleted

tux > helm delete --purge susecf-uaa
release "susecf-uaa" deleted

tux > kubectl delete namespace uaa
namespace "uaa" deleted

Repeat the same process for the susecf-scf release and scf namespace:

tux > kubectl delete statefulsets --all --namespace scf
statefulset "adapter" deleted
statefulset "api-group" deleted
...

tux > helm delete --purge susecf-scf
release "susecf-scf" deleted

tux > kubectl delete namespace scf
namespace "scf" deleted

Then you can start over. Be sure to create new release and namespace names.

31.5 Querying with Kubectl

You can safely query with kubectl to get information about resources inside your Kubernetes cluster. kubectl cluster-info dump | tee clusterinfo.txt outputs a large amount of information about the Kubernetes master and cluster services to a text file.

The following commands give more targeted information about your cluster.

  • List all cluster resources:

    tux > kubectl get all --all-namespaces
  • List all of your running pods:

    tux > kubectl get pods --all-namespaces
  • List all of your running pods, their internal IP addresses, and which Kubernetes nodes they are running on:

    tux > kubectl get pods --all-namespaces --output wide
  • See all pods, including those with Completed or Failed statuses:

    tux > kubectl get pods --show-all --all-namespaces
  • List pods in one namespace:

    tux > kubectl get pods --namespace scf
  • Get detailed information about one pod:

    tux > kubectl describe --namespace scf po/diego-cell-0
  • Read the log file of a pod:

    tux > kubectl logs --namespace scf po/diego-cell-0
  • List all Kubernetes nodes, then print detailed information about a single node:

    tux > kubectl get nodes
    tux > kubectl describe node 6a2752b6fab54bb889029f60de6fa4d5.infra.caasp.local
  • List all containers in all namespaces, formatted for readability:

    tux > kubectl get pods --all-namespaces --output jsonpath="{..image}" |\
    tr -s '[[:space:]]' '\n' |\
    sort |\
    uniq -c
  • These two commands check node capacities, to verify that there are enough resources for the pods:

    tux > kubectl get nodes --output yaml | grep '\sname\|cpu\|memory'
    tux > kubectl get nodes --output json | \
    jq '.items[] | {name: .metadata.name, cap: .status.capacity}'

31.6 autoscaler-metrics Pod Liquibase Error

In some situations, the autoscaler-metrics pod may fail to reach a fully ready state due to a Liquibase error, liquibase.exception.LockException: Could not acquire change log lock .

When this occurs, use the following procedure to resolve the issue.

  1. Verify whether the error has occurred by examining the logs.

    tux > kubectl logs autoscaler-metrics-0 autoscaler-metrics --follow --namespace scf

    If the error has occurred, the log will contain entires similar to the following.

    # Starting Liquibase at Wed, 21 Aug 2019 22:38:00 UTC (version 3.6.3 built at 2019-01-29 11:34:48)
    # Unexpected error running Liquibase: Could not acquire change log lock.  Currently locked by autoscaler-metrics-0.autoscaler-metrics-set.cf.svc.cluster.local (172.16.0.173) since 8/21/19, 9:20 PM
    # liquibase.exception.LockException: Could not acquire change log lock.  Currently locked by autoscaler-metrics-0.autoscaler-metrics-set.cf.svc.cluster.local (172.16.0.173) since 8/21/19, 9:20 PM
    #         at liquibase.lockservice.StandardLockService.waitForLock(StandardLockService.java:230)
    #         at liquibase.Liquibase.update(Liquibase.java:184)
    #         at liquibase.Liquibase.update(Liquibase.java:179)
    #         at liquibase.integration.commandline.Main.doMigration(Main.java:1220)
    #         at liquibase.integration.commandline.Main.run(Main.java:199)
    #         at liquibase.integration.commandline.Main.main(Main.java:137)
  2. After confirming the error has occurred, use the provided Ruby script to recover from the error.

    1. Save the script below to a file. In this example, it is saved to a file named recovery-autoscaler-metrics-liquidbase-stale-lock.

      #!/usr/bin/env ruby
      
      # This script recovers from autoscaler-metrics failing to start because it could
      # not continue schema migrations.  The error looks something like:
      
      # Starting Liquibase at Wed, 21 Aug 2019 22:38:00 UTC (version 3.6.3 built at 2019-01-29 11:34:48)
      # Unexpected error running Liquibase: Could not acquire change log lock.  Currently locked by autoscaler-metrics-0.autoscaler-metrics-set.cf.svc.cluster.local (172.16.0.173) since 8/21/19, 9:20 PM
      # liquibase.exception.LockException: Could not acquire change log lock.  Currently locked by autoscaler-metrics-0.autoscaler-metrics-set.cf.svc.cluster.local (172.16.0.173) since 8/21/19, 9:20 PM
      #         at liquibase.lockservice.StandardLockService.waitForLock(StandardLockService.java:230)
      #         at liquibase.Liquibase.update(Liquibase.java:184)
      #         at liquibase.Liquibase.update(Liquibase.java:179)
      #         at liquibase.integration.commandline.Main.doMigration(Main.java:1220)
      #         at liquibase.integration.commandline.Main.run(Main.java:199)
      #         at liquibase.integration.commandline.Main.main(Main.java:137)
      
      require 'json'
      require 'open3'
      
      EXPECTED_ERROR='Could not acquire change log lock.  Currently locked by'
      
      def capture(*args)
          stdout, status = Open3.capture2(*args)
          return stdout if status.success?
          cmd = args.dup
          cmd.shift while Hash.try_convert(cmd.first)
          cmd.pop while Hash.try_convert(cmd.last)
          fail %Q<"#{cmd.join(' ')}" failed with exit status #{status.exitstatus}>
      end
      
      def namespace
          $namespace ||= JSON.parse(capture('helm list --output json'))['Releases']
              .select { |r| r['Status'].downcase == 'deployed' }
              .select { |r| r['Chart'].start_with? 'cf-' }
              .first['Namespace']
      end
      
      def run(*args)
          Process.wait Process.spawn(*args)
          return if $?.success?
          cmd = args.dup
          cmd.shift while Hash.try_convert(cmd.first)
          cmd.pop while Hash.try_convert(cmd.last)
          fail %Q<"#{cmd.join(' ')}" failed with exit status #{$?.exitstatus}>
      end
      
      postgres_pod = JSON.load(capture('kubectl', 'get', 'pods',
          '--namespace', namespace,
          '--selector', 'skiff-role-name==autoscaler-postgres',
          '--output', 'json'))['items']
          .find { |pod|
              pod['status']['conditions']
                  .find {|status| status['type'] == 'Ready'}['status'] == 'True'
          }['metadata']['name']
      
      psql_binary = capture('kubectl', 'exec', postgres_pod,
          '--namespace', namespace,
          '--', 'sh', '-c', 'echo /var/vcap/packages/postgres-*/bin/psql').split.first
      
      run 'kubectl', 'exec', postgres_pod,
          '--namespace', namespace,
          '--',
          psql_binary,
          '--username', 'postgres',
          '--dbname', 'autoscaler',
          '--command', 'SELECT * FROM databasechangeloglock'
      
      locked_bys = (1..Float::INFINITY).each do |count|
          STDOUT.printf "\rWaiting for pod to get stuck (try %s)...", count
          candidate_locked_bys = capture('kubectl', 'get', 'pods',
                              '--namespace', namespace,
                              '--selector', 'skiff-role-name==autoscaler-metrics',
                              '--output', 'jsonpath={.items[*].metadata.name}')
              .split
              .map do |pod_name|
                  capture('kubectl', 'logs', pod_name,
                          '--namespace', namespace,
                          '--container', 'autoscaler-metrics')
              end
              .select { |message| message.include? EXPECTED_ERROR }
              .map { |message| /Currently locked by (.*?) since /.match(message)[1]}
      
          break candidate_locked_bys unless candidate_locked_bys.empty?
          sleep 1
      end
      puts '' # new line after the wait
      locked_bys.each do |locked_by|
          puts "Releasing lock for #{locked_by}"
          run 'kubectl', 'exec', postgres_pod,
              '--namespace', namespace,
              '--',
              psql_binary,
              '--username', 'postgres',
              '--dbname', 'autoscaler',
              '--command', %Q<UPDATE databasechangeloglock SET locked = FALSE WHERE lockedby = '#{locked_by}'>
      end
    2. Run the script.

      tux > ruby recovery-autoscaler-metrics-liquidbase-stale-lock

      All autoscaler pods should now enter a fully ready state.

31.7 api-group Pod Not Ready after Buildpack Update

When a built-in buildpack is updated as described in one of the scenarios below, the api-group pod will fail to reach a fully ready state upon restart. A built-in buildpack is a buildpack that is included as part of a SUSE Cloud Application Platform release and has no stack association by default.

The following scenarios can result in the api-group not being fully ready:

  • A built-in buildpack is assigned a stack association. For example:

    tux > cf update-buildpack ruby_buildpack --assign-stack sle15
  • A built-in buildpack is updated to to a version with an inherent stack association, such as those provided by upstream Cloud Foundry. For example:

    tux > cf update-buildpack ruby_buildpack -p https://github.com/cloudfoundry/ruby-buildpack/releases/download/v1.7.43/ruby-buildpack-cflinuxfs3-v1.7.43.zip

31.7.1 Recovery and Prevention

In order to bring the api-group pod back to a fully ready state, use the following procedure.

  1. Delete the buildpack that was updated to have a stack association.

    tux > cf delete-buildpack ruby_buildpack
  2. Restart the api-group pod.

    tux > kubectl delete pod api-group-0 --namespace scf

In order to use a buildpack with a stack association and not encounter the issue described in this section, do the following. Instead of updating one of the built-in buildpacks to either assign a stack or update to one with an inherent stack association, create a new buildpack with the desired buildpack files and associated stack.

tux > cf create-buildpack ruby_cflinuxfs3 https://github.com/cloudfoundry/ruby-buildpack/releases/download/v1.7.41/ruby-buildpack-cflinuxfs3-v1.7.41.zip 1 --enable

31.8 mysql Pods Fail to Start

In some cases, mysql pods may fail to start due to various reasons. This can affect the mysql pod in both uaa and scf.

31.8.1 Data initialization or incomplete transactions

  1. Confirm whether the mysql pod failure is related to data initialization problems or incomplete transactions.

    1. Perform an initial inspection of the the mysql container's logs. A pre-start related error should be observed. If the mysql pod from scf is experiencing issues, substitute scf as the namespace.

      tux > kubectl logs --namespace uaa mysql-0
    2. Examine the /var/vcap/sys/log/pxc-mysql/mysql.err.log file from the mysql container filesystem. If the mysql pod from scf is experiencing issues, substitute scf as the namespace.

      tux > kubectl exec --stdin --tty --namespace uaa mysql-0 -- bash -c 'cat /var/vcap/sys/log/pxc-mysql/mysql.err.log'

      One of the following error messages should be found:

      • [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
        [ERROR] Fatal error: Can't open and lock privilege tables: Table 'mysql.user' doesn't exist
      • [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-
        recover switch to commit or rollback pending transactions.
        [ERROR] Aborting

      This can happen when mysql fails to initialize and read its data directory, possibly due to resource limitations in the cluster.

  2. If the mysql pod failure is confirmed to be due to aborted transactions, perform the following procedure to return it to a running state.

    When using the procedure, it is important to note that:

    • In a high availability setup, replication will synchronize data to the new, clean persistent volume claim (PVC).

    • In a single availability setup, backup your data before applying the procedure as it will result in data loss. It is a recommended practice to regulary backup your data.

    1. Delete the persistent volume claim (PVC) attached to the failed mysql pod. Identify the name of the PVC by using kubectl describe pod and inspecting the ClaimName field. If the mysql pod from scf is experiencing issues, substitute scf as the namespace.

      tux > kubectl describe pod --namespace uaa mysql-0
      
      tux > kubectl delete persistentvolumeclaim --namespace uaa mysql-data-mysql-0
    2. Immediately delete the stuck mysql pod. It is important to get the order right as otherwise, PVC deletion will never terminate because of the attached pod. If the mysql pod from scf is experiencing issues, substitute scf as the namespace.

      tux > kubectl delete pod --namespace uaa mysql-0

31.8.2 File ownership issues

If the mysql pod is in state READY: 0/1 and STATUS: Running, like this:

tux > kubectl get pods -namespace uaa
NAME                          READY STATUS    RESTARTS AGE
configgin-helper-0            1/1   Running   0        16m
mysql-0                       0/1   Running   0        16m
mysql-proxy-0                 1/1   Running   0        16m
post-deployment-setup-1-lg7vl 0/1   Completed 0        16m
secret-generation-1-fczw4     0/1   Completed 0        16m
uaa-0                         0/1   Running   0        16m

Verify if all files in the /var/vcap/store directory of the mysql-0 pod are owned by the vcap user:

tux > kubectl exec --stdin --tty --namespace uaa mysql-0 -- bash
mysql/0:/# ls -Al /var/vcap/store/
total 12
drwx------ 2 vcap nobody 4096 Mar 9 07:41 mysql_audit_logs
drwx------ 6 vcap nobody 4096 Mar 9 07:42 pxc-mysql

If not, please verify the configuration of your NFS share.

Print this page