Reorganise into a single 'import process' spike #1023

Cruikshanks · 2024-09-10T14:50:00Z

https://eaflood.atlassian.net/browse/WATER-4670

We've been making changes to the import service to support our work migrating return version management to WRLS, and replacing the current licence import with one ready for ReSP.

We've also attempted to clarify what is actually happening during the various 'jobs' and to arrange them in our environments in a more logical order.

But the latest version that shipped is failing to complete. We believe the issues are:

Each 'job' is taking longer to complete than expected
Based on current timings, the 'jobs' are overlapping
Our attempts to connect the new ReSP import with the existing one are adding to the burden
Based on investigating the logs, we think downstream jobs are no longer being triggered

Our users and the wider team consider the 'import' a single process. Though we know differently in the dev team, we also discuss it in terms of 'the import'. If we were to make it so, we believe we'd eliminate the timing and resource issues, have a better chance of investigating where it has gone wrong, and find it easier to maintain until we can finally eliminate it altogether.

This change is about spiking that idea into existence. From the spike, we'll then break down the changes into focused ones.

https://eaflood.atlassian.net/browse/WATER-4670 We've been making changes to the import service to support our work migrating return version management to WRLS, and replacing the current licence import with [one ready for ReSP](https://eaflood.atlassian.net/browse/WATER-4535). We've also attempted to make clearer what is actually happening during the various 'jobs', and in our environments the order they happen to be more logical. But the latest version that shipped is failing to complete. We believe the issues are: - Each 'job' is taking longer to complete than expected - Based on current timings the 'jobs' are overlapping - Our attempts to connect the new ReSP import with the existing one are adding to the burden - Based on investigating the logs, we think downstream jobs are no longer being triggered Our users and the wider team think of the 'import' as a single process. Though we know differently in the dev team, we also discuss it in terms of 'the import'. If we were to make it so, we believe we'd do away with the timing and resource issues, have a better chance of investigating where it has gone wrong, and find it easier to maintain till we can finally do away with it altogether. This change is about spiking that idea into existence. From the spike we'll then break the changes down into focused ones.

Will avoid every module having there own db service that just does this.

Going forward it will be useful to better track the time each step and process takes.

First step in trying to make things clearer. This copies what the nald-import is doing when downloading and extracting the NALD zip file from the S3 bucket. But it only does those elements, and where we can we've ditched unnecessary abstraction.

This copies what we are doing in 3 other jobs to either mark as deleted or actually delete records from the DB. Bringing them into a process called clean makes it clearer what the steps are doing.

The Repository class looks to have been intended to have a generic way to generate queries to the DB using config. Vastly complex though, and completely unnecessary, especially as we know the 2 insert queries needed. Removed in an effort to simplify the solution and eek out performance improvements.

When we looked at the timings for the existing process, we saw the NALD import took approximately 3.5 hours to process all licences. Our initial working version was going to take 5 hours (average 4 minutes per 1000) based on the timings we now show (though this was based on my local machine). Removing the repository and batching the work in sets of 10, I got the time down to under 2 hours (average 1.5 minutes per 1000).

Import of the companies is the first 'job' in the licence import. In `production` it takes 1 hour 40 mins. Seems to take a similar time locally. This version took 9 1/2 minutes!

This will be the 'one ring to rule them all' process that we'll be switching to going forward. I still need to migrate the remaining jobs of the licence import to be process steps. But having this in place will allow me to kick off what I have and double check all is well so far.

Will call it something else in the short term

Now we have the nald-data and permit processes we no longer need this module. However, the `charging-import` has a direct dependence on everything related to the `transform-permit` module that was in nald-import. To avoid breaking it we move it and all related modules to charging-import.

We shouldn't be importing this data on a nightly basis. We have functionality that depends on this reference so we should be in control of when and how it gets populated. But just in case we move all the old reference data import jobs to a new 'reference' process which will only ever be manually triggered from here on.

These have been migrated to the new processes.

We'll need to think of an alternate approach. But till then we can simply remove them.

CRM and WATER came out of licence-import, and both depend on functions in DateHelpers.

However, need to switch to Promise.allSettled() and log errors.

Using `Promise.allSettled` we can deal with individual licences that fail to import and not break the whole import. Having left the import to run (33 minutes) we dug into the errors we were receiving. One was failing to convert a date. Another was a licence with no non-draft versions. The rest were from not filtering 'draft' status licence versions from the queries for purposes and conditions.

By getting the data we need for the licences when fetching the licences to import in `import.js`, we save approximately 73K calls to the DB. In our tests over 5000 licences this saved 5 secs. Not massive, but a noticeable improvement so worth doing.

This is used to report on the status of the pg-boss 'jobs'. As removed all of them this endpoint and the summary it produces is redundant.

We don't use the information in the table anymore.

Not even sure why they were here rather than in the old `nald-import` in the first place!

These dependencies are now redundant as we no longer rely on their functionality.

This was the last pg-boss job we needed to migrate.

We now update the nightly import process to daisy chain the processes, avoiding running later process if earlier ones have failed. We also collate the results into an object from which we generate a message we can include in the tracker email, as we no longer have pg-boss job stats.

It was just confirming the token used matched the one read in from the .env file. If you have access to the server you have access to the file so auth was a joke. Removing the pointless auth will make triggering job easier from Jenkins.

It's the one that takes the longest and is not as critical as most of the others.

We generate the email content based on the order of the items in the object.

https://eaflood.atlassian.net/browse/WATER-4670 We've been making changes to the import service to support our work migrating return version management to WRLS, and replacing the current licence import with [one ready for ReSP](https://eaflood.atlassian.net/browse/WATER-4535). We've also attempted to clarify what is actually happening during the various 'jobs' and to arrange them in our environments in a more logical order. But the latest version that shipped is failing to complete. We believe the issues are: - Each 'job' is taking longer to complete than expected - Based on current timings, the 'jobs' are overlapping - Our attempts to connect the new ReSP import with the existing one are adding to the burden - Based on investigating the logs, we think downstream jobs are no longer being triggered We're [working on simplifying the project](#1023) and removing those elements we think decrease performance, for example, the use of [pg-boss](https://github.com/timgit/pg-boss). But in the meantime we need to get the import back up and running. This change removes the extra jobs we put in for connecting to the [water-abstraction-system](https://github.com/DEFRA/water-abstraction-system) till we can come up with a better alternative. We also update the default schedule to allow each part of the import more time to complete to avoid overlapping.

https://eaflood.atlassian.net/browse/WATER-4670 We've been making changes to the import service to support our work migrating return version management to WRLS, and replacing the current licence import with [one ready for ReSP](https://eaflood.atlassian.net/browse/WATER-4535). We've also attempted to clarify what is actually happening during the various 'jobs' and to arrange them in our environments in a more logical order. But the latest version that shipped is failing to complete. We believe the issues are: - Each 'job' is taking longer to complete than expected - Based on current timings, the 'jobs' are overlapping - Our attempts to connect the new ReSP import with the existing one are adding to the burden - Based on investigating the logs, we think downstream jobs are no longer being triggered We're [working on simplifying the project](#1023) and removing elements that we think decrease performance, such as the use of [pg-boss](https://github.com/timgit/pg-boss). But in the meantime, we need to get the import back up and running. This change removes the extra jobs we put in for connecting to the [water-abstraction-system](https://github.com/DEFRA/water-abstraction-system) until we can develop a better alternative. We also update the default schedule to allow each part of the import more time to complete to avoid overlapping.

https://eaflood.atlassian.net/browse/WATER-4670 After making some changes to [water-abstraction-import](https://github.com/DEFRA/water-abstraction-import) we found it was no longer completing when run. The issue seemed to be - a step we introduced to let [water-abstraction-system](https://github.com/DEFRA/water-abstraction) see importing licences was overloading the environment because it was acting like a DOS on the system app - each job was taking much longer than realised to complete, and the existing job schedule was overlapping them, again leading to limited system resources We've always suspected that the features the previous team added to improve performance (pg-boss message queues and caching to Redis) were actually impacting it. Initially, we planned to go through all the 'jobs' involved in importing data from NALD and strip out this overhead. We [carried out a spike](DEFRA/water-abstraction-import#1023), and our timings are much, _much_ better, proving our suspicions. However, this was a live issue, so we removed the step and tweaked the times to get the import working again. In the meantime, we'll either complete the spike or [WATER-4535, a replacement for the licence import](https://eaflood.atlassian.net/browse/WATER-4535). Either way, the block of job stats we display on the `/health/info` page will be redundant. In fact, we don't use or look at it anyway, even when there is an issue. So, this removes all logic related to job information from the `/health/info` page in readiness.

Cruikshanks added the do not merge Used for spikes and experiments label Sep 10, 2024

Cruikshanks self-assigned this Sep 10, 2024

Cruikshanks added 27 commits September 10, 2024 23:06

Add a new way to log an error msg without airbrake

d6f931f

Add query method to the DB object

f4e9e56

Will avoid every module having there own db service that just does this.

Add our calculate time loggers

37515de

Going forward it will be useful to better track the time each step and process takes.

Stop registering pg-boss based plugins

8f6f00d

Add new clean process

94f35a0

This copies what we are doing in 3 other jobs to either mark as deleted or actually delete records from the DB. Bringing them into a process called clean makes it clearer what the steps are doing.

Add routes for our new processes to main routes

f46e635

Working version permit and document header import

bd0e79d

Working version of the companies import

b28fb59

Import of the companies is the first 'job' in the licence import. In `production` it takes 1 hour 40 mins. Seems to take a similar time locally. This version took 9 1/2 minutes!

Correct controller import naming

20eb296

Remove water import

a2fde21

Will call it something else in the short term

Convert return versions from a job to a process

cbe8b92

Rename companies-import to crm

79db199

Rename permit-import to just permit

dd21737

Convert mod-logs from job to a process

099e8de

Housekeeping - flag param as unused

7c37613

Migrate charging-import to a process

e021088

Remove delete document and company import jobs

639459e

These have been migrated to the new processes.

Remove the queue and import system job

9837193

We'll need to think of an alternate approach. But till then we can simply remove them.

Migrate bill runs import to be a process

0ac9e61

Move DateHelpers to lib

5887735

CRM and WATER came out of licence-import, and both depend on functions in DateHelpers.

Working version of the licence import job

425f9a1

However, need to switch to Promise.allSettled() and log errors.

Cruikshanks added 23 commits September 16, 2024 14:35

Reduce DB calls in water import

5cc4390

By getting the data we need for the licences when fetching the licences to import in `import.js`, we save approximately 73K calls to the DB. In our tests over 5000 licences this saved 5 secs. Not massive, but a noticeable improvement so worth doing.

Add points import step to water process

1182461

Remove old licence-import jobs

e0cfcb8

Rename crm process to company-details

c27f4e6

Update company details to avoid 70K DB requests

1a56efa

Update company details with allSettled and counts

7b66faf

Update permit process with allSettled and counts

bb7a434

Rename water process licence-details

8c48fa2

Remove jobs endpoint

cc3e048

This is used to report on the status of the pg-boss 'jobs'. As removed all of them this endpoint and the summary it produces is redundant.

Delete Application State modules

5e327ea

We don't use the information in the table anymore.

Delete redundant document header delete modules

7cfbc3e

Not even sure why they were here rather than in the old `nald-import` in the first place!

Fix missed references to deleted modules

ff8c957

Remove pgboss, node-cron & catbox-redis

ab495ad

These dependencies are now redundant as we no longer rely on their functionality.

Migrate tracker job to process

f5a211a

This was the last pg-boss job we needed to migrate.

Fix errors in nightly process & tracker

bcccec4

ALways send an email and better handle process fail

ee65f26

Remove JWT authentication

1a36924

It was just confirming the token used matched the one read in from the .env file. If you have access to the server you have access to the file so auth was a joke. Removing the pointless auth will make triggering job easier from Jenkins.

Remove redundant env vars from example

d2f9e0a

Make batch sizes configurable

5917c4b

Move permit process to last place in nightly

78dda8f

It's the one that takes the longest and is not as critical as most of the others.

Ensure results align with process order

1dd24e4

We generate the email content based on the order of the items in the object.

Cruikshanks mentioned this pull request Sep 17, 2024

Remove system jobs from licence-import #1026

Merged

Cruikshanks mentioned this pull request Sep 26, 2024

Remove import jobs stats from /health/info DEFRA/water-abstraction-system#1336

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorganise into a single 'import process' spike #1023

Reorganise into a single 'import process' spike #1023

Cruikshanks commented Sep 10, 2024

Reorganise into a single 'import process' spike #1023

Are you sure you want to change the base?

Reorganise into a single 'import process' spike #1023

Conversation

Cruikshanks commented Sep 10, 2024