Troubleshooting
Problem
Running Spark application generates Spark logs that are stored in the underlying storage volume. Accumulation of log files can cause the Kubernetes cluster to delay container creation and often display the error message "Context deadline exceeded". This causes the Spark pod to stall in the
CreateContainerErr state.Symptom
The Kubernetes cluster delays the creation of containers and also displays the error message "Context deadline exceeded". This causes the Spark pod to stall in the
CreateContainerErr state.Resolving The Problem
To resolve the issue, you can manually clear the Spark logs or use the 'cronjobs' to clear the Spark logs.Procedure:
- Run the following command to get the spark-hb-nginx image.
oc get deploy spark-hb-nginx -o=jsonpath='{$.spec.template.spec.containers[:1].image}'- Based on the use case, you can use one of the following methods to clean the Spark jobs:
- Manual cleanup using Pod with root and SeLinuxRelabelling disabled : Use Manual method of clearing Spark logs when you have Admin privilege and SeLinuxRelabelling disabled.
cat <<EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: name: test-pod-1 labels: env: test spec: containers: - name: nginx image: <spark-hb-nginx-image-here> imagePullPolicy: IfNotPresent securityContext: allowPrivilegeEscalation: false runAsUser: 0 seLinuxOptions: type: "spc_t" volumeMounts: - mountPath: /mnt/asset_file_api name: file-api-pv args: - /bin/sh - -c - sleep infinity volumes: - name: file-api-pv persistentVolumeClaim: claimName: file-api-claim EOF
- Cronjob to cleanup Spark jobs/kernels from Watson Studio : Use Cronjob to clear Spark jobs from Watson Studio.
apiVersion: batch/v1 kind: CronJob metadata: name: spark-project-logs-spark-events-cleanup spec: schedule: "0 0 */2 * *" jobTemplate: spec: template: spec: containers: - name: demo-clean image: <spark-nginx-image-here> args: - /bin/sh - -c - "find /mnt/asset_file_api/projects/*/* -maxdepth 1 -mindepth 1 -type d -name 'logs' -mtime +2 -exec rm -rf {} + && find /mnt/asset_file_api/projects/*/* -maxdepth 1 -mindepth 1 -type d -name 'spark-events' -mtime +2 -exec rm -rf {} +" volumeMounts: - name: file-api-pv mountPath: /mnt/asset_file_api resources: limits: cpu: 400m memory: 512Mi requests: cpu: 200m memory: 256Mi restartPolicy: OnFailure volumes: - name: file-api-pv persistentVolumeClaim: claimName: file-api-claim
- Cronjob to cleanup Spark jobs created by Service Instances with Spaces enabled: Use Cronjob to clear the Spark jobs in the service instance with deployment space.
apiVersion: batch/v1 kind: CronJob metadata: name: spark-spaces-logs-cleanup spec: schedule: "0 0 */2 * *" jobTemplate: spec: template: spec: containers: - name: demo-clean image: <spark-nginx-image-here> args: - /bin/sh - -c - "find /mnt/asset_file_api/spaces/*/assets/runtimes/spark -maxdepth 1 -mindepth 1 -type d -mtime +2 -exec rm -rf {} +" volumeMounts: - name: file-api-pv mountPath: /mnt/asset_file_api resources: limits: cpu: 400m memory: 512Mi requests: cpu: 200m memory: 256Mi restartPolicy: OnFailure volumes: - name: file-api-pv persistentVolumeClaim: claimName: file-api-claim
- Cronjob to cleanup Spark jobs created by WKC Profiling (MDE) jobs: Use Cronjob to cleanup Spark jobs created by WKC Profiling Metadata enabling jobs.
apiVersion: batch/v1 kind: CronJob metadata: name: spark-wkc-logs-cleanup spec: schedule: "0 0 */2 * *" jobTemplate: spec: template: spec: containers: - name: demo-clean image: <spark-nginx-image-here> args: - /bin/sh - -c - "find /mnt/wkc_volume/<instance-id> -maxdepth 1 -mindepth 1 -type d -mtime +2 -exec rm -rf {} +" volumeMounts: - name: wkc_volume mountPath: /mnt/wkc_volume resources: limits: cpu: 400m memory: 512Mi requests: cpu: 200m memory: 256Mi restartPolicy: OnFailure volumes: - name: wkc_volume persistentVolumeClaim: claimName: volumes-profstgintrnl-pvc
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHGYS","label":"IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8m3p000000UoQtAAK","label":"Administration"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]
Was this topic helpful?
Document Information
Modified date:
27 February 2024
UID
ibm16980928