Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm chart deployment failing with /app/app-config.gz: not in gzip format #10

Open
vara-bonthu opened this issue Nov 5, 2022 · 6 comments

Comments

@vara-bonthu
Copy link

vara-bonthu commented Nov 5, 2022

I have tried to deploy the Batch Processing Gateway using the Helm chart in EKS cluster but i have hit an error saying gzip: /app/app-config.gz: not in gzip format.

1/ Built a new docker image for (x86 arch) from Dockerfile in main branch and pushed the image to public ECR repo. public.ecr.aws/r1l5w1y9/batch-processing-gateway . Here is the values.yaml

2/ Generated a bpg config yaml file and converted this file to base64 encoded string and added this string to encodedConfig in Helm chart values.yaml

3/ This deployment also uses AWS S3 bucket and PostgresRDS with IRSA configured for BPG service account added to bpg and bpg-helper deployments

Here is the Terraform code snippet for the deployment

I can see the output of ConfigMap value for bpg but the pod is still throwing that config is not found.

Please advise if i am missing anything

Errors

I am seeing the following error from BPG and BPG-helper pods.

gzip: /app/app-config.gz: not in gzip format
14:34:33.217 [main] INFO com.apple.spark.BPGApplication - Starting server, version: 1.1, revision: 
io.dropwizard.configuration.ConfigurationParsingException: /etc/app_config/app-config.yaml has an error:
  * Configuration at /etc/app_config/app-config.yaml must not be empty

	at io.dropwizard.configuration.ConfigurationParsingException$Builder.build(ConfigurationParsingException.java:278)
	at io.dropwizard.configuration.BaseConfigurationFactory.build(BaseConfigurationFactory.java:87)
	at io.dropwizard.cli.ConfiguredCommand.parseConfiguration(ConfiguredCommand.java:139)
	at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:85)
	at io.dropwizard.cli.Cli.run(Cli.java:78)
	at io.dropwizard.Application.run(Application.java:94)
	at com.apple.spark.BPGApplication.main(BPGApplication.java:87)

image

@yuchaoran2011
Copy link
Collaborator

Thanks @vara-bonthu for reporting this issue. Looks like when the project was open sourced, we missed documenting this step. A prerequisite before the YAML string is encoded is compressing the string. A Python code snippet that does it looks like:

import gzip, base64
compressed_yaml = gzip.compress(bytes(raw_yaml, 'utf-8'))
binary_data = base64.b64encode(compressed_yaml).decode('utf-8')

Then the resulting binary_data is the one that gets put in the Helm chart. cc' @tongtianqi777

@vara-bonthu
Copy link
Author

Thanks, @yuchaoran20! I will give it a try and raise a PR for the docs.

@vara-bonthu
Copy link
Author

vara-bonthu commented Nov 7, 2022

Thanks! This above issue has been resolved now with the Python script. I am able to successfully deploy the BPG helm chart.

I have moved on to the next step of executing the sample job but I have hit a new error. This does look like some permission issue or Kubernetes client.

Job details

➜  vara git:(bpm) ✗ curl -u admin:admin http://localhost:8080/apiv2/spark -i -X POST \
    -H 'Content-Type: application/json' \
    -d '{
        "applicationName": "demo",
        "queue": "dev",
        "sparkVersion": "3.2",
        "mainApplicationFile": "s3a//spark-3279499/uploaded/foo/MinimalSparkApp.py",
        "driver": {
          "cores": 1,
          "memory": "2g"
        },
        "executor": {
            "instances": 1,
            "cores": 1,
            "memory": "2g"
        }
    }'
HTTP/1.1 500 Internal Server Error
Date: Mon, 07 Nov 2022 22:27:20 GMT
Content-Type: application/json
Content-Length: 196

{"code":500,"message":"io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [SparkApplicationResource]  with name: [null]  in namespace: [spark-team-a]  failed."}%  

BPG Pod log error

WARN  [2022-11-07 14:59:27,429] io.fabric8.kubernetes.client.informers.cache.Controller: Reflector list-watching job exiting because the thread-pool is shutting down
**! java.io.FileNotFoundException: /root/.kube/config (No such file or directory)**
! at java.base/java.io.FileInputStream.open0(Native Method)
! at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
! at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
! at com.fasterxml.jackson.dataformat.yaml.YAMLFactory.createParser(YAMLFactory.java:354)
! at com.fasterxml.jackson.dataformat.yaml.YAMLFactory.createParser(YAMLFactory.java:15)
! at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3494)
! at io.fabric8.kubernetes.client.internal.KubeConfigUtils.parseConfig(KubeConfigUtils.java:42)
! at io.fabric8.kubernetes.client.utils.OIDCTokenRefreshInterceptor.intercept(OIDCTokenRefreshInterceptor.java:43)
! at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
! at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
! at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
! at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
! at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
! at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:151)
! at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
! at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
! at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
! at okhttp3.RealCall.execute(RealCall.java:93)
! at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:472)
! at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:435)
! at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:418)
! at io.fabric8.kubernetes.client.dsl.base.BaseOperation.listRequestHelper(BaseOperation.java:160)
! ... 8 common frames omitted
**! Causing: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [list]  for kind: [SparkApplicationResource]  with name: [null]  in namespace: [spark-team-a]  failed.**
! at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
! at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
! at io.fabric8.kubernetes.client.dsl.base.BaseOperation.listRequestHelper(BaseOperation.java:167)
! at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:675)
! at io.fabric8.kubernetes.client.informers.SharedInformerFactory$1.list(SharedInformerFactory.java:169)
! at io.fabric8.kubernetes.client.informers.SharedInformerFactory$1.list(SharedInformerFactory.java:164)
! at io.fabric8.kubernetes.client.informers.cache.Reflector.getList(Reflector.java:67)
! ... 4 common frames omitted
**! Causing: java.util.concurrent.RejectedExecutionException: Error while doing ReflectorRunnable list**
! at io.fabric8.kubernetes.client.informers.cache.Reflector.getList(Reflector.java:73)
! at io.fabric8.kubernetes.client.informers.cache.Reflector.reListAndSync(Reflector.java:94)
! at io.fabric8.kubernetes.client.informers.cache.Reflector.listAndWatch(Reflector.java:80)
! ... 2 common frames omitted
**! Causing: java.util.concurrent.RejectedExecutionException: Error while starting ReflectorRunnable watch**
! at io.fabric8.kubernetes.client.informers.cache.Reflector.listAndWatch(Reflector.java:85)
! at io.fabric8.kubernetes.client.informers.cache.Controller.run(Controller.java:112)
! at java.base/java.lang.Thread.run(Thread.java:833)

This issue is similar to mine kubeflow/spark-operator#1277

How do i pass env variable to the above Submission payload to try the workaround mentioned in the issue?

@tongtianqi777
Copy link
Collaborator

Hi @vara-bonthu yes the BPG does create sparkoperator.k8s.io/v1beta2.SparkApplication resource on the Spark cluster for Spark Operator to pick it up.

It looks like that your fabric8 client is trying to use local kube config file, while it is supposed to use the tokens from BPG config.

A few pointers that may help:
as fabric8 client has a hierarchy of client configs,

  • Confirm that you did not modify the entrypoint.sh which comes with kubernetes.auth.tryKubeConfig=false. You can simply log into the pod to see it.
  • Confirm that your BPG pod didn't have KUBERNETES_AUTH_TRYKUBECONFIG env var set.
  • Confirm that your BPG config has valid masterUrl, caCertDataSOPS, userName and userTokenSOPS.

@vara-bonthu
Copy link
Author

Thanks @tongtianqi777

Confirm that you did not modify the entrypoint.sh which comes with [kubernetes.auth.tryKubeConfig=false]
(https://github.com/apple/batch-processing-gateway/blob/main/entrypoint.sh#L18). You can simply log into the pod to see it.

  • I simply cloned the repo and created a Docker image(public.ecr.aws/r1l5w1y9/batch-processing-gateway) without any changes to the code .

Here is the config from the BPG Pod
image

Confirm that your BPG pod didn't have KUBERNETES_AUTH_TRYKUBECONFIG env var set.

  • Verified and this variable doesn't exist in the pod config

image

Confirm that your BPG config has valid masterUrl, caCertDataSOPS, userName and userTokenSOPS.

Yes, I have all the config in my config.yaml. This is how I generated the tokens and replaced in the below YAML.

context=$(kubectl config current-context)
serverUrl=$(kubectl config view -o jsonpath='{.clusters[?(@.name == "'${context}'")].cluster.server}')
saSecret=$(kubectl -n spark-team-a get sa/spark-team-a -o json | jq -r '.secrets[] | .name')
saToken=$(kubectl -n spark-team-a get secret/${saSecret} -o json | jq -r '.data.token')
saCA=$(kubectl -n spark-team-a get secret/${saSecret} -o json | jq -r '.data."ca.crt"')

Did you see anything wrong in the config?

defaultSparkConf:
  spark.kubernetes.submission.connectionTimeout: 30000
  spark.kubernetes.submission.requestTimeout: 30000
  spark.kubernetes.driver.connectionTimeout: 30000
  spark.kubernetes.driver.requestTimeout: 30000
  spark.sql.debug.maxToStringFields: 75

sparkClusters:
  - weight: 100
    id: cluster-id-1
    eksCluster: arn:aws:eks:us-west-2:345645745645:cluster/spark-k8s-operator
    masterUrl: https://37F929459SDHFHDFHC7275B1.sk1.us-west-2.eks.amazonaws.com
    caCertDataSOPS: REDACTED/FOR/SECURITY/LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWU==
    userTokenSOPS: REDACTED/FOR/SECURITY/ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNklrUXplRGhsV1hOQlNuaEZiM1ptTFVFeFREaDNhR3gzWlRoSllXRmtSWEZaVTNZNGRYa3dPRTVzVHpnaWZRLmV5SnBjM01pT2lKcmRXSmxjbTVsZEdWekwzTmxjblpwWTJWaFkyTnZkVzUScDFwbU96QW5odkNCS3VDaUdoaUE0SlNMMGR5TDlpMTdhTVFrVVZHYW1WTkdBSFYxMFJadzBKSnd5Rjh1YXpXMkMyTFlmNHZobmlyTDl4MkVsV2x0T01yRW9WUTJiM3djemx3N2xGWEFLMk5DYlJQNWxmVklfdlBOaDhTaGdVa0s3U0VzRFR0U1Y1LTBZU1FVZ3RpUzF5VlZqVi1lWjFrX2Q2Mnl6U0g2bkNzUGdSelNQakx1VHd1NEs4Ny05V1dDd3RwV0pCZW05eVcwcVROUDVCSWl3
    userName: spark-team-a  # Data Team name
    sparkApplicationNamespace: spark-team-a # Namespace for running Spark jobs with Cluster role and Cluster role bindings to access Spark Operator 
    sparkServiceAccount: spark-team-a # Service account for running Spark jobs configured with IRSA
    sparkVersions:
      - 3.2
      - 3.1
    queues:  # These are the queues available in YuniKorn
      - dev
      - test
      - default
      - prod
    ttlSeconds: 86400  # 1 day TTL for terminated spark application
    timeoutMillis: 180000
    sparkUIUrl: http://localhost:8080
    batchScheduler: yunikorn
    sparkConf:
      spark.kubernetes.executor.podNamePrefix: '{spark-application-resource-name}'
      spark.eventLog.enabled: "true"
      spark.kubernetes.allocation.batch.size: 2000
      spark.kubernetes.allocation.batch.delay: 1s
      spark.eventLog.dir: s3a://spark-32794/eventlog
      spark.history.fs.logDirectory: s3a://spark-32794/eventlog
      spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
      spark.hadoop.fs.s3a.change.detection.version.required: false
      spark.hadoop.fs.s3a.change.detection.mode: none
      spark.hadoop.fs.s3a.fast.upload: true
      spark.jars.packages: org.apache.hadoop:hadoop-aws:3.2.2
      spark.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider # Use IRSA
#      spark.hadoop.hive.metastore.uris: thrift://hms.endpoint.com:9083
      spark.sql.warehouse.dir: s3a://spark-3279/warehouse
      spark.sql.catalogImplementation: hive
      spark.jars.ivy: /opt/spark/work-dir/.ivy2
      spark.hadoop.fs.s3a.connection.ssl.enabled: false
    sparkUIOptions:
      ServicePort: 4040
      ingressAnnotations:
        nginx.ingress.kubernetes.io/rewrite-target: /$2
        nginx.ingress.kubernetes.io/proxy-redirect-from: http://\$host/
        nginx.ingress.kubernetes.io/proxy-redirect-to: /spark-applications/{spark-application-resource-name}/
        kubernetes.io/ingress.class: nginx
        nginx.ingress.kubernetes.io/configuration-snippet: |-
          proxy_set_header Accept-Encoding "";
          sub_filter_last_modified off;
          sub_filter '<head>' '<head> <base href="/spark-applications/{spark-application-resource-name}/">';
          sub_filter 'href="/' 'href="';
          sub_filter 'src="/' 'src="';
          sub_filter '/{{num}}/jobs/' '/jobs/';
          sub_filter "setUIRoot('')" "setUIRoot('/spark-applications/{spark-application-resource-name}/')";
          sub_filter "document.baseURI.split" "document.documentURI.split";
          sub_filter_once off;
      ingressTLS:  # TODO configure with proper domain name
        - hosts:
            - localhost
          secretName: localhost-tls-secret
    driver:
      env:
        - name: STATSD_SERVER_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: STATSD_SERVER_PORT
          value: "8125"
        - name: AWS_STS_REGIONAL_ENDPOINTS
          value: "regional"
    executor:
      env:
        - name: STATSD_SERVER_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: STATSD_SERVER_PORT
          value: "8125"
        - name: AWS_STS_REGIONAL_ENDPOINTS
          value: "regional"

sparkImages:
  - name: apache/spark-py:v3.2.2
    types:
      - Python
    version: "3.2"
  - name: apache/spark:v3.2.2
    types:
      - Java
      - Scala
    version: "3.2"

s3Bucket: spark-3279
s3Folder: uploaded
sparkLogS3Bucket: spark-3279
sparkLogIndex: index/index.txt
batchFileLimit: 2016
sparkHistoryDns: localhost
gatewayDns: localhost
sparkHistoryUrl: http://localhost:8088
allowedUsers:
  - '*'
blockedUsers:
  - blocked_user_1
queues:
  - name: dev
    maxRunningMillis: 21600000
queueTokenSOPS: {}
dbStorageSOPS:
  connectionString: jdbc:postgresql://bpg.abcdefgh.us-west-2.rds.amazonaws.com:5432/bpg?useUnicode=yes&characterEncoding=UTF-8&useLegacyDatetimeCode=false&connectTimeout=10000&socketTimeout=30000
  user: bpg
  password: <REDACTED/FOR/SECURITY>
  dbName: bpg
statusCacheExpireMillis: 9000
server:
  applicationConnectors:
    - type: http
      port: 8080
logging:
  level: INFO
  loggers:
    com.apple.spark: INFO
sops: {}

Spark Operator addon has been deployed using Helm Chart with namespace as a spark-opeartor and service account as spark-operator.

YuniKorn has been deployed using helm chart in yunikorn namespace.

I am able to use spark-team-a namespace to submit the jobs to Spark Operator and it works as expected.

@tongtianqi777
Copy link
Collaborator

Hey @vara-bonthu , if you are using IntelliJ for debugging, you can follow the steps below to run BPG locally with break points:

  • configure the JDK 17 for the project
  • go to file BPGApplication.java
  • right click in the editor and add a Run Configuration with the following info:
    • SDK: java 17
    • class with main method: com.apple.spark.BPGApplication
    • CLI arguments: server <relative path to your local BPG config in work dir>
  • run it with the configuration

to speed up the start time, optionally you can set the connectionString in BPG config to null so that it won't connect to DB first before running the endpoints. of course it won't write anything to DB either.

radezz pushed a commit to radezz/batch-processing-gateway that referenced this issue Mar 12, 2024
switch postgres to mysql connector
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants