Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganise into a single 'import process' spike #1023

Draft
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

Cruikshanks
Copy link
Member

https://eaflood.atlassian.net/browse/WATER-4670

We've been making changes to the import service to support our work migrating return version management to WRLS, and replacing the current licence import with one ready for ReSP.

We've also attempted to clarify what is actually happening during the various 'jobs' and to arrange them in our environments in a more logical order.

But the latest version that shipped is failing to complete. We believe the issues are:

  • Each 'job' is taking longer to complete than expected
  • Based on current timings, the 'jobs' are overlapping
  • Our attempts to connect the new ReSP import with the existing one are adding to the burden
  • Based on investigating the logs, we think downstream jobs are no longer being triggered

Our users and the wider team consider the 'import' a single process. Though we know differently in the dev team, we also discuss it in terms of 'the import'. If we were to make it so, we believe we'd eliminate the timing and resource issues, have a better chance of investigating where it has gone wrong, and find it easier to maintain until we can finally eliminate it altogether.

This change is about spiking that idea into existence. From the spike, we'll then break down the changes into focused ones.

https://eaflood.atlassian.net/browse/WATER-4670

We've been making changes to the import service to support our work migrating return version management to WRLS, and replacing the current licence import with [one ready for ReSP](https://eaflood.atlassian.net/browse/WATER-4535).

We've also attempted to make clearer what is actually happening during the various 'jobs', and in our environments the order they happen to be more logical.

But the latest version that shipped is failing to complete. We believe the issues are:

- Each 'job' is taking longer to complete than expected
- Based on current timings the 'jobs' are overlapping
- Our attempts to connect the new ReSP import with the existing one are adding to the burden
- Based on investigating the logs, we think downstream jobs are no longer being triggered

Our users and the wider team think of the 'import' as a single process. Though we know differently in the dev team, we also discuss it in terms of 'the import'. If we were to make it so, we believe we'd do away with the timing and resource issues, have a better chance of investigating where it has gone wrong, and find it easier to maintain till we can finally do away with it altogether.

This change is about spiking that idea into existence. From the spike we'll then break the changes down into focused ones.
@Cruikshanks Cruikshanks added the do not merge Used for spikes and experiments label Sep 10, 2024
@Cruikshanks Cruikshanks self-assigned this Sep 10, 2024
Will avoid every module having there own db service that just does this.
Going forward it will be useful to better track the time each step and process takes.
First step in trying to make things clearer. This copies what the nald-import is doing when downloading and extracting the NALD zip file from the S3 bucket.

But it only does those elements, and where we can we've ditched unnecessary abstraction.
This copies what we are doing in 3 other jobs to either mark as deleted or actually delete records from the DB. Bringing them into a process called clean makes it clearer what the steps are doing.
The Repository class looks to have been intended to have a generic way to generate queries to the DB using config.

Vastly complex though, and completely unnecessary, especially as we know the 2 insert queries needed.

Removed in an effort to simplify the solution and eek out performance improvements.
When we looked at the timings for the existing process, we saw the NALD import took approximately 3.5 hours to process all licences.

Our initial working version was going to take 5 hours (average 4 minutes per 1000) based on the timings we now show (though this was based on my local machine).

Removing the repository and batching the work in sets of 10, I got the time down to under 2 hours (average 1.5 minutes per 1000).
Import of the companies is the first 'job' in the licence import. In `production` it takes 1 hour 40 mins. Seems to take a similar time locally. This version took 9 1/2 minutes!
This will be the 'one ring to rule them all' process that we'll be switching to going forward. I still need to migrate the remaining jobs of the licence import to be process steps. But having this in place will allow me to kick off what I have and double check all is well so far.
Will call it something else in the short term
Now we have the nald-data and permit processes we no longer need this module. However, the `charging-import` has a direct dependence on everything related to the `transform-permit` module that was in nald-import. To avoid breaking it we move it and all related modules to charging-import.
We shouldn't be importing this data on a nightly basis. We have functionality that depends on this reference so we should be in control of when and how it gets populated.

But just in case we move all the old reference data import jobs to a new 'reference' process which will only ever be manually triggered from here on.
These have been migrated to the new processes.
We'll need to think of an alternate approach. But till then we can simply remove them.
CRM and WATER came out of licence-import, and both depend on functions in DateHelpers.
However, need to switch to Promise.allSettled() and log errors.
Using `Promise.allSettled` we can deal with individual licences that fail to import and not break the whole import.

Having left the import to run (33 minutes) we dug into the errors we were receiving. One was failing to convert a date. Another was a licence with no non-draft versions. The rest were from not filtering 'draft' status licence versions from the queries for purposes and conditions.
By getting the data we need for the licences when fetching the licences to import in `import.js`, we save approximately 73K calls to the DB.

In our tests over 5000 licences this saved 5 secs. Not massive, but a noticeable improvement so worth doing.
This is used to report on the status of the pg-boss 'jobs'. As removed all of them this endpoint and the summary it produces is redundant.
We don't use the information in the table anymore.
Not even sure why they were here rather than in the old `nald-import` in the first place!
These dependencies are now redundant as we no longer rely on their functionality.
This was the last pg-boss job we needed to migrate.
We now update the nightly import process to daisy chain the processes, avoiding running later process if earlier ones have failed. We also collate the results into an object from which we generate a message we can include in the tracker email, as we no longer have pg-boss job stats.
It was just confirming the token used matched the one read in from the .env file. If you have access to the server you have access to the file so auth was a joke.

Removing the pointless auth will make triggering job easier from Jenkins.
It's the one that takes the longest and is not as critical as most of the others.
We generate the email content based on the order of the items in the object.
Cruikshanks added a commit that referenced this pull request Sep 17, 2024
https://eaflood.atlassian.net/browse/WATER-4670

We've been making changes to the import service to support our work migrating return version management to WRLS, and replacing the current licence import with [one ready for ReSP](https://eaflood.atlassian.net/browse/WATER-4535).

We've also attempted to clarify what is actually happening during the various 'jobs' and to arrange them in our environments in a more logical order.

But the latest version that shipped is failing to complete. We believe the issues are:

- Each 'job' is taking longer to complete than expected
- Based on current timings, the 'jobs' are overlapping
- Our attempts to connect the new ReSP import with the existing one are adding to the burden
- Based on investigating the logs, we think downstream jobs are no longer being triggered

We're [working on simplifying the project](#1023) and removing those elements we think decrease performance, for example, the use of [pg-boss](https://github.com/timgit/pg-boss).

But in the meantime we need to get the import back up and running. This change removes the extra jobs we put in for connecting to the [water-abstraction-system](https://github.com/DEFRA/water-abstraction-system) till we can come up with a better alternative. We also update the default schedule to allow each part of the import more time to complete to avoid overlapping.
Cruikshanks added a commit that referenced this pull request Sep 17, 2024
https://eaflood.atlassian.net/browse/WATER-4670

We've been making changes to the import service to support our work migrating return version management to WRLS, and replacing the current licence import with [one ready for ReSP](https://eaflood.atlassian.net/browse/WATER-4535).

We've also attempted to clarify what is actually happening during the various 'jobs' and to arrange them in our environments in a more logical order.

But the latest version that shipped is failing to complete. We believe the issues are:

- Each 'job' is taking longer to complete than expected
- Based on current timings, the 'jobs' are overlapping
- Our attempts to connect the new ReSP import with the existing one are adding to the burden
- Based on investigating the logs, we think downstream jobs are no longer being triggered

We're [working on simplifying the project](#1023) and removing elements that we think decrease performance, such as the use of [pg-boss](https://github.com/timgit/pg-boss).

But in the meantime, we need to get the import back up and running. This change removes the extra jobs we put in for connecting to the [water-abstraction-system](https://github.com/DEFRA/water-abstraction-system) until we can develop a better alternative. We also update the default schedule to allow each part of the import more time to complete to avoid overlapping.
Cruikshanks added a commit to DEFRA/water-abstraction-system that referenced this pull request Sep 26, 2024
https://eaflood.atlassian.net/browse/WATER-4670

After making some changes to [water-abstraction-import](https://github.com/DEFRA/water-abstraction-import) we found it was no longer completing when run. The issue seemed to be

- a step we introduced to let [water-abstraction-system](https://github.com/DEFRA/water-abstraction) see importing licences was overloading the environment because it was acting like a DOS on the system app
- each job was taking much longer than realised to complete, and the existing job schedule was overlapping them, again leading to limited system resources

We've always suspected that the features the previous team added to improve performance (pg-boss message queues and caching to Redis) were actually impacting it.

Initially, we planned to go through all the 'jobs' involved in importing data from NALD and strip out this overhead. We [carried out a spike](DEFRA/water-abstraction-import#1023), and our timings are much, _much_ better, proving our suspicions.

However, this was a live issue, so we removed the step and tweaked the times to get the import working again.

In the meantime, we'll either complete the spike or [WATER-4535, a replacement for the licence import](https://eaflood.atlassian.net/browse/WATER-4535). Either way, the block of job stats we display on the `/health/info` page will be redundant. In fact, we don't use or look at it anyway, even when there is an issue.

So, this removes all logic related to job information from the `/health/info` page in readiness.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge Used for spikes and experiments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant