Oracle-ES consistency #240

Evildoor · 2019-04-03T10:33:42Z

Add a mechanism for checking the consistency of data between Oracle and elasticsearch.

Currently the consistency is only checked for tasks (not datasets) by comparing the timestamps.

Consistency check is launched by executing Utils/Dataflow/run/data4es-consistency-check.

The first step in ensuring consistency between Oracle and ES is to obtain a very basic set of task data - id and timestamp - from Oracle. Add a query for doing so.

es.get() raises NotFoundError in both cases - when index does not exist and when document does not exist. Also, it's more reasonable to check index once since it's the same for all messages.

- Add/update functions and their parameters' descriptions. - Update the script's description. - Add consistency' description into README.

Check that all fields supplied in input data are present in ES and their values are matching the input data, instead of working only with tasks and their timestamps. This will allow checking tasks' other fields as well as different types of documents such as datasets. Add stage 016 into consistency chain because it adds the fields required for getting documents of given type from ES.

Prepare the script for further development, where incosistent tasks will be automatically reloaded into ES.

mgolosova

I think there should also be added data samples for this new pipeline: output of 009 with the new query, of 016 with that new input, etc.

Utils/Dataflow/009_oracleConnector/query/consistency.sql

Utils/Dataflow/069_upload2es/README

Utils/Dataflow/069_upload2es/consistency.py

Utils/Dataflow/run/data4es-consistency-check

Utils/Dataflow/shell_lib/get_config

Utils/Dataflow/069_upload2es/consistency.py

These functions are either used by several scripts or will be in the future. Move them to library to uphold DRY principle.

DEBUG mode in data4es-start exists to check the workflow without uploading anything to ES. Consistency check writes nothing, so DEBUG is unnecessary here. Do not redirect the stages' stderrs, leave them as-is.

While the script is the stage 069's counterpart in data4es-consistency-check, they share no functionality.

- State what is retrieved by the query. - Remove unnecessary information.

- Show an error message and exit if no host, port, or index is specified. - Remove default values of the parameters.

Printing all discovered inconsistent records to stdout as a batch contradicts with various things, such as pyDKB's file mode and the possibility of controlling the workflow with Apache Kafka. Create an output message with _id and _type for each inconsistent record. Still exit with code 1 if at least one inconsistent record was found, 0 otherwise.

Evildoor · 2019-04-18T13:47:27Z

#199 is almost completed, will wait for it to merge master here before adding samples, to avoid conflicts in READMEs.

mgolosova · 2019-04-24T18:30:13Z

Проверка согласованности ES@DKB и ProdSys DB

mgolosova · 2019-05-17T10:37:13Z

Utils/Dataflow/069_upload2es/consistency.py

@@ -153,7 +153,11 @@ def main(args):
        cfg = load_config(stage.ARGS.conf)
        stage.process = process
        es_connect(cfg)
-        stage.run()
+        if not es.indices.exists(INDEX):


Just for information: this check would cause AuthorizationException in case of remote connection via Nginx proxy:

>>> es = elasticsearch.Elasticsearch('http://login:[email protected]:9200') >>> es.indices.exists('test_prodsys_rucio_ami') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped return func(*args, params=params, **kwargs) File "/usr/lib/python2.7/site-packages/elasticsearch/client/indices.py", line 213, in exists params=params) File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 318, in perform_request status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout) File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 185, in perform_request self._raise_error(response.status, raw_data) File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info) elasticsearch.exceptions.AuthorizationException: TransportError(403, u'')

I updated proxy configuration to allow HEAD requests (to readable locations), as there`s no actual need to block them; so now there should be no problem with it.

And (again, just for information), there was another way to hit the goal:

Check for index' existence before working.
es.get() raises NotFoundError in both cases - when index does not exist and
when document does not exist. Also, it's more reasonable to check index once
since it's the same for all messages.

Even with NotFoundError it is possible to say one situation from another:

>>> try: es.get(index='test_prodsys_rucio_ami', doc_type='task', id=1468216) ... except Exception, err: pass ... >>> err.info {u'found': False, u'_type': u'task', u'_id': u'1468216', u'_index': u'test_prodsys_rucio_ami'} >>> >>> try: es.get(index='_no_such_index_', doc_type='task', id=14682166) ... except Exception, err: pass >>> err.info {u'status': 404, u'error': {u'index_uuid': u'_na_', u'index': u'tprodsys_rucio_ami', u'resource.type': u'index_expression', u'root_cause': [{u'index_uuid': u'_na_', u'index': u'tprodsys_rucio_ami', u'resource.type': u'index_expression', u'resource.id': u'tprodsys_rucio_ami', u'reason': u'no such index', u'type': u'index_not_found_exception'}], u'reason': u'no such index', u'type': u'index_not_found_exception', u'resource.id': u'tprodsys_rucio_ami'}} >>> err.info['error']['reason'] u'no such index'

And in case of the error -- or, maybe, even in case of any error, when info['error'] is defined (but I am not sure if there can possibly be any other error) -- re-raise the exception to indicate that the process can not be continued. Maybe wrapping the exception into DataflowException.

The only situation when it makes any difference is when during the process execution the index was removed or access policy changed: the check was successfully passed on the start, so any NotFoundError after this will be taken as "record missed", no matter what. But I don`t think it is likely to happen, so there`s nothing wrong in the one-time check.

mgolosova

In general everything looks fine.

But I didn`t add any comments to the last commit (808280c): they will appear later or go to e-mail (need to think a little more about it).

The only thing that should be done in this PR before it can be merged is handling of "service" JSON fields. Specifically, two moments:

"service" fields should be filtered out from the "data" fields before checking the consistency (as "service" fields most likely are not written to ES as document content fields);
in case of parent/child relationship in ES, child document can not be taken by the ID with get(): it also requires parent ID specified.

The last point is where I feel like I might owe you an apology: if there was search() method used (as you meant to do initially), it would work same way for both tasks and output_datasets. Another point is that ids query is definitely more suitable for batch mode processing (which, I hope, will sooner or later be added to the processing mode).

Due to this I see here at least three ways to go:

leave things 'as-is' (adding to README note that for output_datasets it most likely won`t work properly). It is OK as right now we only check the tasks consistency -- but may be a bit frustrating;
add _parent handling in current version. It looks like the most stright and cheap way to make things work as they were meant to;
rework things once again to use ids query. It feels like the more time consuming way, and may be excessive (see below why).

If you take a look at the PR #244 -- I`m working on some kind of a "common" wrapper around ES with commonly used functions, including the "get record [fields] by id[+type][+parent]", etc. When I finish I most probably will try to adapt your code to use it -- to check that the library module covers all (known) needed use cases, as well as to make sure that things that one of us have already stumbled over will not appear again in different ES-using cenarios (and will be handled same way everywhere). And as I tought about it here -- the "ids" version of get() will be added there as well; so switching back to search() only in the view of the batch processing plans may be a premature optimization.

But it still may look better for you that _parent-aware version and/or leaving things as is (or maybe you see fourth and fifth way to go) -- so it is totaly up to you, what to choose.

Utils/Dataflow/069_upload2es/consistency.py

Utils/Dataflow/shell_lib/eop_filter

mgolosova · 2019-05-20T08:01:50Z

@Evildoor, another thing I have just noticed after regular consistency check: the scenario seems not to handle chain_data and chain_id fields properly. Error messages I`ve got look this way:
(WARN) 2019-05-20T08:05:01.301149 Document (task, 18024963) differs between Oracle and ES: Oracle:{u'chain_data': [18024963], u'chain_id': 18024963, u'taskid': 18024963, u'task_timestamp': u'14-05-2019 02:11:40'} ES:{u'task_timestamp': u'14-05-2019 02:11:40', u'chain_data': [17988291, 17988294, 18024963], u'chain_id': 17988291, u'taskid': 18024963}

Evildoor · 2019-05-20T09:54:23Z

the scenario seems not to handle chain_data and chain_id fields properly

Checked - this seems to happen as following:

009 produces data with taskid and timestamp.
016, among other things, checks data for chain_data, finds none, considers it to be missing and sets the task to be a root of its own chain.
071 sees incorrectly defined chain_data and chain_id, treats them as fields to check, and finds a discrepancy between them and correct values in ES.

Considering the possible situations:

Normal data loading, where it is possible to encounter records with missing chain_data.
Consistency check described above, where chain_data and chain_id shouldn't actually be checked.
Consistency check which includes chain_data and/or chain_id that should be checked.

I believe that this issue should be fixed by adding key --skip-empty-chains to stage 016 that will skip the processing of chain_data if it is missing. Will do so later if no objections will be present.

Evildoor · 2019-05-20T14:00:01Z

Moved 808280c into a separate branch per email discussion, so that we can finish and merge the existing commits without it hindering us (as it's a somewhat different problem that can possibly require extensive work to be dealt with):

https://github.com/PanDAWMS/dkb/commits/oracle-es-consistency-limit

mgolosova · 2019-05-21T12:02:14Z

@Evildoor wrote:

@mgolosova wrote:

the scenario seems not to handle chain_data and chain_id fields properly

Checked - this seems to happen as following:

009 produces data with taskid and timestamp.

016, among other things, checks data for chain_data, finds none, considers it to be missing and sets the task to be a root of its own chain.

071 sees incorrectly defined chain_data and chain_id, treats them as fields to check, and finds a discrepancy between them and correct values in ES.

Right, thank you for checking. That`s what I thought is happening, too.

I believe that this issue should be fixed by adding key --skip-empty-chains to stage 016 that will skip the processing of chain_data if it is missing. Will do so later if no objections will be present.

In fact, there is a similar situation with phys_category:

dkb/Utils/Dataflow/071_esConsistency/consistency.py

Lines 154 to 156 in 7117242

    
           # Crutch. Remove unwanted (for now) field added by Stage 016. 
        
           if 'phys_category' in data: 
        
               del data['phys_category']

I think these two cases (chain_* and phys_category fields) should be treated similarly.
So handle chain_* fields just the same way looks like the cheapest solution for me.

Yet there`s nothing wrong in asking Stage 016 not to produce extra fields, as you suggest. Except that it would better to take care of both chain_* and phys_category with this new option.

What is below is definitely out of this PR scope, yet worth thinking.

This situation itself says that something went wrong in general: stage 016 performs not an "atomic" data transformation, and it hinders us from reusing it 'as is'. It means that the "good solution" here would be to split Stage 016 in two smaller, more "atomic" stages: one that takes care of ES indexing information, another -- transforming/extending/checking task metadata. And maybe the first one should work with any type of input messages, not only those with task metadata -- to move this functionality out from 091 as well, making it more "atomic" too.

It is, again, shouildn`t be done in this PR; and the view of coming ES schema reworking (that will heavely affect the data4es process) maybe it doesn`t worth doing at all. Or maybe it can become the first step towards these changes, we just need to think a bit longer and try to foresee what`s coming.
Maybe we should discuss this question on the next meeting (on Fri)?

Evildoor · 2019-05-21T14:31:25Z

This situation itself says that something went wrong in general: stage 016 performs not an "atomic" data transformation, and it hinders us from reusing it 'as is'.
...
Maybe we should discuss this question on the next meeting (on Fri)?

Good idea, let's do that. Meanwhile, I've implemented this suggestion:

I think these two cases (chain_* and phys_category fields) should be treated similarly.
So handle chain_* fields just the same way looks like the cheapest solution for me.

for here-and-now checking of just taskid and timestamp.

Evildoor · 2019-05-22T11:25:01Z

_parent field is now passed to get(), other service fields (ones starting with underscore and not treated already) are removed from the data before ES request.

If you take a look at the PR #244 -- I`m working on some kind of a "common" wrapper around ES with commonly used functions, including the "get record [fields] by id[+type][+parent]", etc. When I finish I most probably will try to adapt your code to use it -- to check that the library module covers all (known) needed use cases, as well as to make sure that things that one of us have already stumbled over will not appear again in different ES-using cenarios (and will be handled same way everywhere). And as I tought about it here -- the "ids" version of get() will be added there as well; so switching back to search() only in the view of the batch processing plans may be a premature optimization.

Good to hear this, we definitely need common library for ES-related work.

Utils/Dataflow/071_esConsistency/consistency.py

mgolosova

This PR is approved for further merging.
Yet there`s a couple of insignificant comments, so before I merge things please let me know when you take a look at them and decide to change (or not to change) anything.

Utils/Dataflow/071_esConsistency/consistency.py

Evildoor · 2019-05-28T10:28:56Z

@mgolosova, please, review again.

Type of _id is unknown - it can be str or int for task, and str for dataset.

The field is required to get child documents such as output datasets.

Service fields are different from data fields and shouldn't be checked.

These are unnecessary because library files are not supposed to be executed.

Evildoor added 6 commits April 3, 2019 10:44

Add consistency query to stage 009.

9903938

The first step in ensuring consistency between Oracle and ES is to obtain a very basic set of task data - id and timestamp - from Oracle. Add a query for doing so.

Add script for consistency check to stage 069.

42d55b3

Get index name from config rather than code.

f27f278

Check for index' existence before working.

af9f212

es.get() raises NotFoundError in both cases - when index does not exist and when document does not exist. Also, it's more reasonable to check index once since it's the same for all messages.

Update documentation.

4c560a6

- Add/update functions and their parameters' descriptions. - Update the script's description. - Add consistency' description into README.

Add a very basic consistency check script.

2296f6d

Evildoor self-assigned this Apr 3, 2019

Evildoor force-pushed the oracle-es-consistency branch from 57159c1 to b8a2ab1 Compare April 5, 2019 09:14

Evildoor added 2 commits April 5, 2019 12:08

Save and display the info about different tasks.

acfe45a

Prepare the script for further development, where incosistent tasks will be automatically reloaded into ES.

Merge remote-tracking branch 'origin/master' into oracle-es-consistency

e39efe2

Evildoor changed the title ~~[WIP] Oracle-ES consistency~~ Oracle-ES consistency Apr 5, 2019

Evildoor requested a review from mgolosova April 5, 2019 13:06

mgolosova reviewed Apr 11, 2019

View reviewed changes

Evildoor commented Apr 17, 2019

View reviewed changes

Utils/Dataflow/069_upload2es/consistency.py Outdated Show resolved Hide resolved

Evildoor force-pushed the oracle-es-consistency branch from 4154b73 to b305a69 Compare April 17, 2019 12:24

Move certain shell functions to library.

8a71791

These functions are either used by several scripts or will be in the future. Move them to library to uphold DRY principle.

Evildoor force-pushed the oracle-es-consistency branch from b305a69 to 8a71791 Compare April 17, 2019 13:18

Evildoor added 2 commits April 17, 2019 15:24

Remove DEBUG mode.

90380a9

DEBUG mode in data4es-start exists to check the workflow without uploading anything to ES. Consistency check writes nothing, so DEBUG is unnecessary here. Do not redirect the stages' stderrs, leave them as-is.

Move ES consistency script into a separate stage.

7bac202

While the script is the stage 069's counterpart in data4es-consistency-check, they share no functionality.

Evildoor force-pushed the oracle-es-consistency branch from 27e6f95 to 7bac202 Compare April 17, 2019 14:25

Evildoor added 7 commits April 18, 2019 11:55

Update a query description.

12dd86e

- State what is retrieved by the query. - Remove unnecessary information.

Update and explain a magic number.

944b5a2

Reword es_connect() description.

26a1dfe

Change log prefixes to standard ones.

20875b1

Fix pop() results handling.

46cf0af

Update ES parameters handling.

72d85a9

- Show an error message and exit if no host, port, or index is specified. - Remove default values of the parameters.

Evildoor force-pushed the oracle-es-consistency branch from aa7e153 to cacba11 Compare April 18, 2019 13:41

mgolosova mentioned this pull request Apr 26, 2019

[obsolete] pyDKB storages #244

Open

mgolosova reviewed May 17, 2019

View reviewed changes

Utils/Dataflow/069_upload2es/consistency.py Outdated Show resolved Hide resolved

Utils/Dataflow/069_upload2es/consistency.py Outdated Show resolved Hide resolved

Utils/Dataflow/shell_lib/eop_filter Outdated Show resolved Hide resolved

Evildoor force-pushed the oracle-es-consistency branch from f059a61 to 7117242 Compare May 20, 2019 13:54

Ignore two additional fields.

165c5d2

Evildoor force-pushed the oracle-es-consistency branch from 9bee9f7 to b8d412f Compare May 22, 2019 11:18

mgolosova reviewed May 27, 2019

View reviewed changes

Utils/Dataflow/071_esConsistency/consistency.py Outdated Show resolved Hide resolved

mgolosova reviewed May 27, 2019

View reviewed changes

Utils/Dataflow/071_esConsistency/consistency.py Show resolved Hide resolved

mgolosova previously approved these changes May 27, 2019

View reviewed changes

mgolosova reviewed May 27, 2019

View reviewed changes

Utils/Dataflow/071_esConsistency/consistency.py Outdated Show resolved Hide resolved

Evildoor dismissed mgolosova’s stale review via ee65922 May 28, 2019 10:12

Evildoor force-pushed the oracle-es-consistency branch from b8d412f to ee65922 Compare May 28, 2019 10:12

Evildoor force-pushed the oracle-es-consistency branch from ee65922 to 93e650c Compare May 28, 2019 10:32

Evildoor added 5 commits May 28, 2019 12:35

Change messages formatting.

8f84d22

Type of _id is unknown - it can be str or int for task, and str for dataset.

Add _parent field handling.

d195650

The field is required to get child documents such as output datasets.

Remove service fields before checking.

612bf52

Service fields are different from data fields and shouldn't be checked.

Remove interpreter directives from lib files.

01ae258

These are unnecessary because library files are not supposed to be executed.

Simplify a field retrieval.

8f86ddd

Evildoor force-pushed the oracle-es-consistency branch from 93e650c to 8f86ddd Compare May 28, 2019 10:36

mgolosova approved these changes May 28, 2019

View reviewed changes

mgolosova merged commit 51accc9 into master May 28, 2019

mgolosova deleted the oracle-es-consistency branch May 28, 2019 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oracle-ES consistency #240

Oracle-ES consistency #240

Evildoor commented Apr 3, 2019 •

edited

Loading

mgolosova left a comment

Evildoor commented Apr 18, 2019 •

edited

Loading

mgolosova commented Apr 24, 2019

mgolosova May 17, 2019

mgolosova May 17, 2019

mgolosova left a comment

mgolosova commented May 20, 2019

Evildoor commented May 20, 2019 •

edited

Loading

Evildoor commented May 20, 2019

mgolosova commented May 21, 2019

Evildoor commented May 21, 2019 •

edited

Loading

Evildoor commented May 22, 2019 •

edited

Loading

mgolosova left a comment

Evildoor commented May 28, 2019

Oracle-ES consistency #240

Oracle-ES consistency #240

Conversation

Evildoor commented Apr 3, 2019 • edited Loading

mgolosova left a comment

Choose a reason for hiding this comment

Evildoor commented Apr 18, 2019 • edited Loading

mgolosova commented Apr 24, 2019

mgolosova May 17, 2019

Choose a reason for hiding this comment

mgolosova May 17, 2019

Choose a reason for hiding this comment

mgolosova left a comment

Choose a reason for hiding this comment

mgolosova commented May 20, 2019

Evildoor commented May 20, 2019 • edited Loading

Evildoor commented May 20, 2019

mgolosova commented May 21, 2019

Evildoor commented May 21, 2019 • edited Loading

Evildoor commented May 22, 2019 • edited Loading

mgolosova left a comment

Choose a reason for hiding this comment

Evildoor commented May 28, 2019

Evildoor commented Apr 3, 2019 •

edited

Loading

Evildoor commented Apr 18, 2019 •

edited

Loading

Evildoor commented May 20, 2019 •

edited

Loading

Evildoor commented May 21, 2019 •

edited

Loading

Evildoor commented May 22, 2019 •

edited

Loading