IBM Support

Cleaning Spark history logs

Troubleshooting


Problem

Running Spark application generates Spark logs that are stored in the underlying storage volume.  Accumulation of log files can cause the Kubernetes cluster to delay container creation and often display the error message "Context deadline exceeded". This causes the Spark pod to stall in the CreateContainerErr state.

Symptom

The Kubernetes cluster delays the creation of containers and also displays the error message "Context deadline exceeded". This causes the Spark pod to stall in the CreateContainerErr state.

Resolving The Problem

To resolve the issue, you can manually clear the Spark logs or use the 'cronjobs' to clear the Spark logs.
Procedure:
  1. Run the following command to get the spark-hb-nginx image.
    oc get deploy spark-hb-nginx -o=jsonpath='{$.spec.template.spec.containers[:1].image}'
  2. Based on the use case, you can use one of the following methods to clean the Spark jobs:
  • Manual cleanup using Pod with root and SeLinuxRelabelling disabled : Use Manual method of clearing Spark logs when you have Admin privilege and SeLinuxRelabelling disabled.
    cat <<EOF | oc apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: test-pod-1
      labels:
        env: test
    spec:
      containers:
      - name: nginx
        image: <spark-hb-nginx-image-here>
        imagePullPolicy: IfNotPresent
        securityContext:
          allowPrivilegeEscalation: false
          runAsUser: 0
          seLinuxOptions:
            type: "spc_t"
        volumeMounts:
          - mountPath: /mnt/asset_file_api
            name: file-api-pv
        args:
          - /bin/sh
          - -c
          - sleep infinity
      volumes:
          - name: file-api-pv
            persistentVolumeClaim:
              claimName: file-api-claim
    EOF
    
  • Cronjob to cleanup Spark jobs/kernels from Watson Studio : Use Cronjob to clear Spark jobs from Watson Studio.
    apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: spark-project-logs-spark-events-cleanup
    spec:
      schedule: "0 0 */2 * *"
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: demo-clean
                image: <spark-nginx-image-here>
                args:
                - /bin/sh
                - -c
                - "find /mnt/asset_file_api/projects/*/* -maxdepth 1 -mindepth 1 -type d -name 'logs' -mtime +2 -exec rm -rf {} + && find /mnt/asset_file_api/projects/*/* -maxdepth 1 -mindepth 1 -type d -name 'spark-events' -mtime +2 -exec rm -rf {} +"
                volumeMounts:
                - name: file-api-pv
                  mountPath: /mnt/asset_file_api
              resources:
                limits:
                  cpu: 400m
                  memory: 512Mi
                requests:
                  cpu: 200m
                  memory: 256Mi
              restartPolicy: OnFailure
              volumes:
              - name: file-api-pv
                persistentVolumeClaim:
                  claimName: file-api-claim
    
  • Cronjob to cleanup Spark jobs created by Service Instances with Spaces enabled: Use Cronjob to clear the Spark jobs in the service instance with deployment space.
    apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: spark-spaces-logs-cleanup
    spec:
      schedule: "0 0 */2 * *"
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: demo-clean
                image: <spark-nginx-image-here>
                args:
                - /bin/sh
                - -c
                - "find /mnt/asset_file_api/spaces/*/assets/runtimes/spark -maxdepth 1 -mindepth 1 -type d -mtime +2 -exec rm -rf {} +"
                volumeMounts:
                - name: file-api-pv
                  mountPath: /mnt/asset_file_api
              resources:
                limits:
                  cpu: 400m
                  memory: 512Mi
                requests:
                  cpu: 200m
                  memory: 256Mi
              restartPolicy: OnFailure
              volumes:
              - name: file-api-pv
                persistentVolumeClaim:
                  claimName: file-api-claim
  • Cronjob to cleanup Spark jobs created by WKC Profiling (MDE) jobs: Use Cronjob to cleanup Spark jobs created by WKC Profiling Metadata enabling jobs.
    apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: spark-wkc-logs-cleanup
    spec:
      schedule: "0 0 */2 * *"
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: demo-clean
                image: <spark-nginx-image-here>
                args:
                - /bin/sh
                - -c
                - "find /mnt/wkc_volume/<instance-id> -maxdepth 1 -mindepth 1 -type d -mtime +2 -exec rm -rf {} +"
                volumeMounts:
                - name: wkc_volume
                  mountPath: /mnt/wkc_volume
              resources:
                limits:
                  cpu: 400m
                  memory: 512Mi
                requests:
                  cpu: 200m
                  memory: 256Mi
              restartPolicy: OnFailure
              volumes:
              - name: wkc_volume
                persistentVolumeClaim:
                  claimName: volumes-profstgintrnl-pvc

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHGYS","label":"IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8m3p000000UoQtAAK","label":"Administration"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
27 February 2024

UID

ibm16980928