Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove locks when finishing a job #1637

Merged
merged 2 commits into from
Aug 7, 2023
Merged

remove locks when finishing a job #1637

merged 2 commits into from
Aug 7, 2023

Conversation

severo
Copy link
Collaborator

@severo severo commented Aug 7, 2023

should fix the issue with old remaining locks, when a job is killed (too long job, after 40 minutes) while it's uploading files to the Hub (lock created with git_branch()).

also: add environment variables in docker compose and helm, add the description in readme, and fix test value

also: add the description in readme, and fix test value
@severo severo marked this pull request as draft August 7, 2023 18:27
@github-actions
Copy link

github-actions bot commented Aug 7, 2023

ArgoCD Diff for commit e6d687d

Updated at 8/7/2023, 9:01:03 PM CEST

App: datasets-server-prod
YAML generation: Success 🟢
App sync status: Out of Sync ⚠️

===== apps/Deployment datasets-server/prod-datasets-server-admin ======
--- /tmp/argocd-diff2347011339/prod-datasets-server-admin-live.yaml	2023-08-07 19:00:59.907269106 +0000
+++ /tmp/argocd-diff2347011339/prod-datasets-server-admin	2023-08-07 19:00:59.903269065 +0000
@@ -419,7 +419,7 @@
           value: "9"
         - name: ADMIN_UVICORN_PORT
           value: "8080"
-        image: huggingface/datasets-server-services-admin:sha-e3f1371
+        image: huggingface/datasets-server-services-admin:sha-f07c091
         imagePullPolicy: IfNotPresent
         livenessProbe:
           failureThreshold: 30

===== apps/Deployment datasets-server/prod-datasets-server-api ======
--- /tmp/argocd-diff2483715450/prod-datasets-server-api-live.yaml	2023-08-07 19:00:59.923269270 +0000
+++ /tmp/argocd-diff2483715450/prod-datasets-server-api	2023-08-07 19:00:59.923269270 +0000
@@ -374,7 +374,7 @@
           value: "9"
         - name: API_UVICORN_PORT
           value: "8080"
-        image: huggingface/datasets-server-services-api:sha-e3f1371
+        image: huggingface/datasets-server-services-api:sha-f07c091
         imagePullPolicy: IfNotPresent
         livenessProbe:
           failureThreshold: 30

===== apps/Deployment datasets-server/prod-datasets-server-rows ======
--- /tmp/argocd-diff1778307238/prod-datasets-server-rows-live.yaml	2023-08-07 19:00:59.963269679 +0000
+++ /tmp/argocd-diff1778307238/prod-datasets-server-rows	2023-08-07 19:00:59.955269597 +0000
@@ -447,7 +447,7 @@
           value: "9"
         - name: API_UVICORN_PORT
           value: "8080"
-        image: huggingface/datasets-server-services-rows:sha-e3f1371
+        image: huggingface/datasets-server-services-rows:sha-f07c091
         imagePullPolicy: IfNotPresent
         livenessProbe:
           failureThreshold: 30

===== apps/Deployment datasets-server/prod-datasets-server-search ======
--- /tmp/argocd-diff3046706752/prod-datasets-server-search-live.yaml	2023-08-07 19:00:59.983269883 +0000
+++ /tmp/argocd-diff3046706752/prod-datasets-server-search	2023-08-07 19:00:59.983269883 +0000
@@ -452,7 +452,7 @@
           value: refs/convert/parquet
         - name: DUCKDB_INDEX_CACHE_DIRECTORY
           value: /storage/duckdb-index
-        image: huggingface/datasets-server-services-search:sha-e3f1371
+        image: huggingface/datasets-server-services-search:sha-f07c091
         imagePullPolicy: IfNotPresent
         livenessProbe:
           failureThreshold: 30

===== apps/Deployment datasets-server/prod-datasets-server-storage-admin ======
--- /tmp/argocd-diff2384099058/prod-datasets-server-storage-admin-live.yaml	2023-08-07 19:01:00.007270129 +0000
+++ /tmp/argocd-diff2384099058/prod-datasets-server-storage-admin	2023-08-07 19:01:00.003270088 +0000
@@ -400,7 +400,7 @@
         helm.sh/chart: datasets-server
     spec:
       containers:
-      - image: huggingface/datasets-server-services-storage-admin:sha-e3f1371
+      - image: huggingface/datasets-server-services-storage-admin:sha-f07c091
         imagePullPolicy: IfNotPresent
         name: prod-datasets-server-storage-admin
         resources:

===== apps/Deployment datasets-server/prod-datasets-server-worker-all ======
--- /tmp/argocd-diff3187170420/prod-datasets-server-worker-all-live.yaml	2023-08-07 19:01:00.047270538 +0000
+++ /tmp/argocd-diff3187170420/prod-datasets-server-worker-all	2023-08-07 19:01:00.039270456 +0000
@@ -634,6 +634,8 @@
           value: "60"
         - name: WORKER_KILL_ZOMBIES_INTERVAL_SECONDS
           value: "600"
+        - name: WORKER_KILL_LONG_JOB_INTERVAL_SECONDS
+          value: "60"
         - name: WORKER_MAX_DISK_USAGE_PCT
           value: "90"
         - name: WORKER_MAX_JOB_DURATION_SECONDS
@@ -737,7 +739,7 @@
           value: "0"
         - name: WORKER_JOB_TYPES_BLOCKED
         - name: WORKER_JOB_TYPES_ONLY
-        image: huggingface/datasets-server-services-worker:sha-e3f1371
+        image: huggingface/datasets-server-services-worker:sha-f07c091
         imagePullPolicy: IfNotPresent
         name: prod-datasets-server-worker
         resources:

===== apps/Deployment datasets-server/prod-datasets-server-worker-light ======
--- /tmp/argocd-diff1895955455/prod-datasets-server-worker-light-live.yaml	2023-08-07 19:01:00.087270947 +0000
+++ /tmp/argocd-diff1895955455/prod-datasets-server-worker-light	2023-08-07 19:01:00.079270865 +0000
@@ -634,6 +634,8 @@
           value: "60"
         - name: WORKER_KILL_ZOMBIES_INTERVAL_SECONDS
           value: "600"
+        - name: WORKER_KILL_LONG_JOB_INTERVAL_SECONDS
+          value: "60"
         - name: WORKER_MAX_DISK_USAGE_PCT
           value: "90"
         - name: WORKER_MAX_JOB_DURATION_SECONDS
@@ -737,7 +739,7 @@
           value: "0"
         - name: WORKER_JOB_TYPES_BLOCKED
         - name: WORKER_JOB_TYPES_ONLY
-        image: huggingface/datasets-server-services-worker:sha-e3f1371
+        image: huggingface/datasets-server-services-worker:sha-f07c091
         imagePullPolicy: IfNotPresent
         name: prod-datasets-server-worker
         resources:

===== batch/CronJob datasets-server/prod-datasets-server-job-backfill ======
--- /tmp/argocd-diff1108197624/prod-datasets-server-job-backfill-live.yaml	2023-08-07 19:01:00.099271070 +0000
+++ /tmp/argocd-diff1108197624/prod-datasets-server-job-backfill	2023-08-07 19:01:00.095271029 +0000
@@ -206,7 +206,7 @@
               value: CreateCommitError,LockedDatasetTimeoutError
             - name: LOG_LEVEL
               value: debug
-            image: huggingface/datasets-server-jobs-cache_maintenance:sha-e3f1371
+            image: huggingface/datasets-server-jobs-cache_maintenance:sha-f07c091
             imagePullPolicy: IfNotPresent
             name: prod-datasets-server-backfill
             resources:

===== batch/CronJob datasets-server/prod-datasets-server-job-delete-indexes ======
--- /tmp/argocd-diff1134456254/prod-datasets-server-job-delete-indexes-live.yaml	2023-08-07 19:01:00.111271193 +0000
+++ /tmp/argocd-diff1134456254/prod-datasets-server-job-delete-indexes	2023-08-07 19:01:00.107271152 +0000
@@ -251,7 +251,7 @@
               value: /storage/duckdb-index
             - name: DUCKDB_INDEX_EXPIRED_TIME_INTERVAL_SECONDS
               value: "259200"
-            image: huggingface/datasets-server-jobs-cache_maintenance:sha-e3f1371
+            image: huggingface/datasets-server-jobs-cache_maintenance:sha-f07c091
             imagePullPolicy: IfNotPresent
             name: prod-datasets-server-delete-indexes
             resources:

===== batch/CronJob datasets-server/prod-datasets-server-job-metrics-collector ======
--- /tmp/argocd-diff1976569464/prod-datasets-server-job-metrics-collector-live.yaml	2023-08-07 19:01:00.119271275 +0000
+++ /tmp/argocd-diff1976569464/prod-datasets-server-job-metrics-collector	2023-08-07 19:01:00.119271275 +0000
@@ -199,7 +199,7 @@
                   optional: false
             - name: CACHE_MAINTENANCE_ACTION
               value: collect-metrics
-            image: huggingface/datasets-server-jobs-cache_maintenance:sha-e3f1371
+            image: huggingface/datasets-server-jobs-cache_maintenance:sha-f07c091
             imagePullPolicy: IfNotPresent
             name: prod-datasets-server-metrics-collector
             resources:

App: datasets-server-staging
YAML generation: Success 🟢
App sync status: Out of Sync ⚠️

===== apps/Deployment datasets-server/staging-datasets-server-admin ======
--- /tmp/argocd-diff699527647/staging-datasets-server-admin-live.yaml	2023-08-07 19:01:02.895299878 +0000
+++ /tmp/argocd-diff699527647/staging-datasets-server-admin	2023-08-07 19:01:02.891299832 +0000
@@ -387,7 +387,7 @@
           value: "1"
         - name: ADMIN_UVICORN_PORT
           value: "8080"
-        image: huggingface/datasets-server-services-admin:sha-e3f1371
+        image: huggingface/datasets-server-services-admin:sha-f07c091
         imagePullPolicy: IfNotPresent
         livenessProbe:
           failureThreshold: 30

===== apps/Deployment datasets-server/staging-datasets-server-api ======
--- /tmp/argocd-diff4110862868/staging-datasets-server-api-live.yaml	2023-08-07 19:01:02.907300018 +0000
+++ /tmp/argocd-diff4110862868/staging-datasets-server-api	2023-08-07 19:01:02.907300018 +0000
@@ -325,7 +325,7 @@
           value: "1"
         - name: API_UVICORN_PORT
           value: "8080"
-        image: huggingface/datasets-server-services-api:sha-e3f1371
+        image: huggingface/datasets-server-services-api:sha-f07c091
         imagePullPolicy: IfNotPresent
         livenessProbe:
           failureThreshold: 30

===== apps/Deployment datasets-server/staging-datasets-server-rows ======
--- /tmp/argocd-diff1849322737/staging-datasets-server-rows-live.yaml	2023-08-07 19:01:02.947300485 +0000
+++ /tmp/argocd-diff1849322737/staging-datasets-server-rows	2023-08-07 19:01:02.943300438 +0000
@@ -438,7 +438,7 @@
           value: "1"
         - name: API_UVICORN_PORT
           value: "8080"
-        image: huggingface/datasets-server-services-rows:sha-e3f1371
+        image: huggingface/datasets-server-services-rows:sha-f07c091
         imagePullPolicy: IfNotPresent
         livenessProbe:
           failureThreshold: 30

===== apps/Deployment datasets-server/staging-datasets-server-search ======
--- /tmp/argocd-diff3129912014/staging-datasets-server-search-live.yaml	2023-08-07 19:01:02.967300718 +0000
+++ /tmp/argocd-diff3129912014/staging-datasets-server-search	2023-08-07 19:01:02.963300671 +0000
@@ -443,7 +443,7 @@
           value: refs/convert/parquet
         - name: DUCKDB_INDEX_CACHE_DIRECTORY
           value: /storage/duckdb-index
-        image: huggingface/datasets-server-services-search:sha-e3f1371
+        image: huggingface/datasets-server-services-search:sha-f07c091
         imagePullPolicy: IfNotPresent
         livenessProbe:
           failureThreshold: 30

===== apps/Deployment datasets-server/staging-datasets-server-storage-admin ======
--- /tmp/argocd-diff1034948407/staging-datasets-server-storage-admin-live.yaml	2023-08-07 19:01:02.995301045 +0000
+++ /tmp/argocd-diff1034948407/staging-datasets-server-storage-admin	2023-08-07 19:01:02.991300998 +0000
@@ -381,7 +381,7 @@
         helm.sh/chart: datasets-server
     spec:
       containers:
-      - image: huggingface/datasets-server-services-storage-admin:sha-e3f1371
+      - image: huggingface/datasets-server-services-storage-admin:sha-f07c091
         imagePullPolicy: IfNotPresent
         name: staging-datasets-server-storage-admin
         resources:

===== apps/Deployment datasets-server/staging-datasets-server-worker-all ======
--- /tmp/argocd-diff3060702760/staging-datasets-server-worker-all-live.yaml	2023-08-07 19:01:03.035301511 +0000
+++ /tmp/argocd-diff3060702760/staging-datasets-server-worker-all	2023-08-07 19:01:03.023301371 +0000
@@ -631,6 +631,8 @@
           value: "60"
         - name: WORKER_KILL_ZOMBIES_INTERVAL_SECONDS
           value: "600"
+        - name: WORKER_KILL_LONG_JOB_INTERVAL_SECONDS
+          value: "60"
         - name: WORKER_MAX_DISK_USAGE_PCT
           value: "90"
         - name: WORKER_MAX_JOB_DURATION_SECONDS
@@ -732,7 +734,7 @@
           value: "0"
         - name: WORKER_JOB_TYPES_BLOCKED
         - name: WORKER_JOB_TYPES_ONLY
-        image: huggingface/datasets-server-services-worker:sha-e3f1371
+        image: huggingface/datasets-server-services-worker:sha-f07c091
         imagePullPolicy: IfNotPresent
         name: staging-datasets-server-worker
         resources:

===== apps/Deployment datasets-server/staging-datasets-server-worker-light ======
--- /tmp/argocd-diff364615218/staging-datasets-server-worker-light-live.yaml	2023-08-07 19:01:03.067301884 +0000
+++ /tmp/argocd-diff364615218/staging-datasets-server-worker-light	2023-08-07 19:01:03.063301838 +0000
@@ -631,6 +631,8 @@
           value: "60"
         - name: WORKER_KILL_ZOMBIES_INTERVAL_SECONDS
           value: "600"
+        - name: WORKER_KILL_LONG_JOB_INTERVAL_SECONDS
+          value: "60"
         - name: WORKER_MAX_DISK_USAGE_PCT
           value: "90"
         - name: WORKER_MAX_JOB_DURATION_SECONDS
@@ -732,7 +734,7 @@
           value: "0"
         - name: WORKER_JOB_TYPES_BLOCKED
         - name: WORKER_JOB_TYPES_ONLY
-        image: huggingface/datasets-server-services-worker:sha-e3f1371
+        image: huggingface/datasets-server-services-worker:sha-f07c091
         imagePullPolicy: IfNotPresent
         name: staging-datasets-server-worker
         resources:

===== batch/CronJob datasets-server/staging-datasets-server-job-delete-indexes ======
--- /tmp/argocd-diff2397154195/staging-datasets-server-job-delete-indexes-live.yaml	2023-08-07 19:01:03.087302117 +0000
+++ /tmp/argocd-diff2397154195/staging-datasets-server-job-delete-indexes	2023-08-07 19:01:03.083302071 +0000
@@ -251,7 +251,7 @@
               value: /storage/duckdb-index
             - name: DUCKDB_INDEX_EXPIRED_TIME_INTERVAL_SECONDS
               value: "600"
-            image: huggingface/datasets-server-jobs-cache_maintenance:sha-e3f1371
+            image: huggingface/datasets-server-jobs-cache_maintenance:sha-f07c091
             imagePullPolicy: IfNotPresent
             name: staging-datasets-server-delete-indexes
             resources:

===== batch/CronJob datasets-server/staging-datasets-server-job-metrics-collector ======
--- /tmp/argocd-diff1423630717/staging-datasets-server-job-metrics-collector-live.yaml	2023-08-07 19:01:03.095302211 +0000
+++ /tmp/argocd-diff1423630717/staging-datasets-server-job-metrics-collector	2023-08-07 19:01:03.095302211 +0000
@@ -199,7 +199,7 @@
                   optional: false
             - name: CACHE_MAINTENANCE_ACTION
               value: collect-metrics
-            image: huggingface/datasets-server-jobs-cache_maintenance:sha-e3f1371
+            image: huggingface/datasets-server-jobs-cache_maintenance:sha-f07c091
             imagePullPolicy: IfNotPresent
             name: staging-datasets-server-metrics-collector
             resources:

Legend Status
The app is synced in ArgoCD, and diffs you see are solely from this PR.
⚠️ The app is out-of-sync in ArgoCD, and the diffs you see include those changes plus any from this PR.
🛑 There was an error generating the ArgoCD diffs due to changes in this PR.

@codecov-commenter
Copy link

Codecov Report

Patch coverage: 100.00% and project coverage change: -7.14% ⚠️

Comparison is base (310a27d) 99.37% compared to head (d462b64) 92.24%.
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1637      +/-   ##
==========================================
- Coverage   99.37%   92.24%   -7.14%     
==========================================
  Files           6       77      +71     
  Lines         161     5464    +5303     
==========================================
+ Hits          160     5040    +4880     
- Misses          1      424     +423     
Flag Coverage Δ
jobs_cache_maintenance ?
services_worker 92.24% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
services/worker/tests/conftest.py 98.94% <100.00%> (ø)

... and 82 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

should fix the issue with old remaining locks, when a job is killed (too
long job, after 40 minutes) while it's uploading files to the Hub (lock
created with git_branch()).
@severo severo changed the title feat: 🎸 add environment variables in dockercompose and helm remove locks when finishing a job Aug 7, 2023
@severo
Copy link
Collaborator Author

severo commented Aug 7, 2023

research notes:

to get all the old remaining locks:

db.locks.find({updated_at: {"$lt": new Date(ISODate().getTime() - 1000*60*40)}})

Today: returns 64 entries, eg:

{ _id: '{"dataset": "KakologArchives/KakologArchives", "branch": "refs/convert/parquet"}',
  owner: '64cadc5c4e007fc04e377edd',
  updated_at: 2023-08-02T23:25:16.131Z }
{ _id: '{"dataset": "JonathanSum/sv_corpora_parliament_processed", "branch": "refs/convert/parquet"}',
  owner: '64c979ce8acff8dd3afd8052',
  updated_at: 2023-08-03T06:09:50.569Z }
{ _id: '{"dataset": "Plim/fr_corpora_parliament_processed", "branch": "refs/convert/parquet"}',
  owner: '64c979e68acff8dd3afd8515',
  updated_at: 2023-08-03T06:17:51.870Z }
{ _id: '{"dataset": "RuudVelo/nl_corpora_parliament_processed", "branch": "refs/convert/parquet"}',
  owner: '64c979e98acff8dd3afd8557',
  updated_at: 2023-08-03T06:25:30.462Z }
{ _id: '{"dataset": "duongttr/vi-job-description-dataset", "branch": "refs/convert/parquet"}',
  owner: '64cb48135084427ff1044baa',
  updated_at: 2023-08-03T07:01:06.002Z }

They have a non-null owner, and have been created with lock.git_branch() (ie: during parquet or duckdb index upload).

Possibly: when a pod was killed (sync? zombie killer?)

  • sync: no. We have a lot of locks with owner on 20230803, at every hour, and there was only one sync in the whole day
  • stats in the last 7 days:

@severo severo marked this pull request as ready for review August 7, 2023 19:08
@severo
Copy link
Collaborator Author

severo commented Aug 7, 2023

I'll manually remove the owner from the existing old lock entries after deploying with:

db.locks.updateMany(
  {
    updated_at: {"$lt": new Date(ISODate().getTime() - 1000*60*40)}
  },
  {
    "$currentDate": {"updated_at": true},
    "$set": {"owner": null}
  }
)

@severo severo merged commit 664692d into main Aug 7, 2023
20 checks passed
@severo severo deleted the fix-locks branch August 7, 2023 19:17
@severo
Copy link
Collaborator Author

severo commented Aug 7, 2023

< { acknowledged: true,
  insertedId: null,
  matchedCount: 64,
  modifiedCount: 64,
  upsertedCount: 0 }
> db.locks.find({updated_at: {"$lt": new Date(ISODate().getTime() - 1000*60*40)}})
< 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants