Cleaning Spark history logs

Troubleshooting

Problem

Running Spark application generates Spark logs that are stored in the underlying storage volume. Accumulation of log files can cause the Kubernetes cluster to delay container creation and often display the error message "Context deadline exceeded". This causes the Spark pod to stall in the CreateContainerErr state.

Symptom

The Kubernetes cluster delays the creation of containers and also displays the error message "Context deadline exceeded". This causes the Spark pod to stall in the CreateContainerErr state.

Resolving The Problem

To resolve the issue, you can manually clear the Spark logs or use the 'cronjobs' to clear the Spark logs.

Procedure:

Run the following command to get the spark-hb-nginx image.

oc get deploy spark-hb-nginx -o=jsonpath='{$.spec.template.spec.containers[:1].image}'

Based on the use case, you can use one of the following methods to clean the Spark jobs:

Manual cleanup using Pod with root and SeLinuxRelabelling disabled : Use Manual method of clearing Spark logs when you have Admin privilege and SeLinuxRelabelling disabled.

cat <<EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-1
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: <spark-hb-nginx-image-here>
    imagePullPolicy: IfNotPresent
    securityContext:
      allowPrivilegeEscalation: false
      runAsUser: 0
      seLinuxOptions:
        type: "spc_t"
    volumeMounts:
      - mountPath: /mnt/asset_file_api
        name: file-api-pv
    args:
      - /bin/sh
      - -c
      - sleep infinity
  volumes:
      - name: file-api-pv
        persistentVolumeClaim:
          claimName: file-api-claim
EOF

Cronjob to cleanup Spark jobs/kernels from Watson Studio : Use Cronjob to clear Spark jobs from Watson Studio.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: spark-project-logs-spark-events-cleanup
spec:
  schedule: "0 0 */2 * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: demo-clean
            image: <spark-nginx-image-here>
            args:
            - /bin/sh
            - -c
            - "find /mnt/asset_file_api/projects/*/* -maxdepth 1 -mindepth 1 -type d -name 'logs' -mtime +2 -exec rm -rf {} + && find /mnt/asset_file_api/projects/*/* -maxdepth 1 -mindepth 1 -type d -name 'spark-events' -mtime +2 -exec rm -rf {} +"
            volumeMounts:
            - name: file-api-pv
              mountPath: /mnt/asset_file_api
          resources:
            limits:
              cpu: 400m
              memory: 512Mi
            requests:
              cpu: 200m
              memory: 256Mi
          restartPolicy: OnFailure
          volumes:
          - name: file-api-pv
            persistentVolumeClaim:
              claimName: file-api-claim

Cronjob to cleanup Spark jobs created by Service Instances with Spaces enabled: Use Cronjob to clear the Spark jobs in the service instance with deployment space.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: spark-spaces-logs-cleanup
spec:
  schedule: "0 0 */2 * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: demo-clean
            image: <spark-nginx-image-here>
            args:
            - /bin/sh
            - -c
            - "find /mnt/asset_file_api/spaces/*/assets/runtimes/spark -maxdepth 1 -mindepth 1 -type d -mtime +2 -exec rm -rf {} +"
            volumeMounts:
            - name: file-api-pv
              mountPath: /mnt/asset_file_api
          resources:
            limits:
              cpu: 400m
              memory: 512Mi
            requests:
              cpu: 200m
              memory: 256Mi
          restartPolicy: OnFailure
          volumes:
          - name: file-api-pv
            persistentVolumeClaim:
              claimName: file-api-claim

Cronjob to cleanup Spark jobs created by WKC Profiling (MDE) jobs: Use Cronjob to cleanup Spark jobs created by WKC Profiling Metadata enabling jobs.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: spark-wkc-logs-cleanup
spec:
  schedule: "0 0 */2 * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: demo-clean
            image: <spark-nginx-image-here>
            args:
            - /bin/sh
            - -c
            - "find /mnt/wkc_volume/<instance-id> -maxdepth 1 -mindepth 1 -type d -mtime +2 -exec rm -rf {} +"
            volumeMounts:
            - name: wkc_volume
              mountPath: /mnt/wkc_volume
          resources:
            limits:
              cpu: 400m
              memory: 512Mi
            requests:
              cpu: 200m
              memory: 256Mi
          restartPolicy: OnFailure
          volumes:
          - name: wkc_volume
            persistentVolumeClaim:
              claimName: volumes-profstgintrnl-pvc

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHGYS","label":"IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8m3p000000UoQtAAK","label":"Administration"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Tips

Cleaning Spark history logs

Troubleshooting

Problem

Symptom

Resolving The Problem

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?