-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oracle-ES consistency #240
Conversation
The first step in ensuring consistency between Oracle and ES is to obtain a very basic set of task data - id and timestamp - from Oracle. Add a query for doing so.
es.get() raises NotFoundError in both cases - when index does not exist and when document does not exist. Also, it's more reasonable to check index once since it's the same for all messages.
- Add/update functions and their parameters' descriptions. - Update the script's description. - Add consistency' description into README.
Check that all fields supplied in input data are present in ES and their values are matching the input data, instead of working only with tasks and their timestamps. This will allow checking tasks' other fields as well as different types of documents such as datasets. Add stage 016 into consistency chain because it adds the fields required for getting documents of given type from ES.
57159c1
to
b8a2ab1
Compare
Prepare the script for further development, where incosistent tasks will be automatically reloaded into ES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should also be added data samples for this new pipeline: output of 009 with the new query, of 016 with that new input, etc.
4154b73
to
b305a69
Compare
These functions are either used by several scripts or will be in the future. Move them to library to uphold DRY principle.
b305a69
to
8a71791
Compare
DEBUG mode in data4es-start exists to check the workflow without uploading anything to ES. Consistency check writes nothing, so DEBUG is unnecessary here. Do not redirect the stages' stderrs, leave them as-is.
While the script is the stage 069's counterpart in data4es-consistency-check, they share no functionality.
27e6f95
to
7bac202
Compare
- State what is retrieved by the query. - Remove unnecessary information.
- Show an error message and exit if no host, port, or index is specified. - Remove default values of the parameters.
Printing all discovered inconsistent records to stdout as a batch contradicts with various things, such as pyDKB's file mode and the possibility of controlling the workflow with Apache Kafka. Create an output message with _id and _type for each inconsistent record. Still exit with code 1 if at least one inconsistent record was found, 0 otherwise.
aa7e153
to
cacba11
Compare
#199 is almost completed, will wait for it to merge |
@@ -153,7 +153,11 @@ def main(args): | |||
cfg = load_config(stage.ARGS.conf) | |||
stage.process = process | |||
es_connect(cfg) | |||
stage.run() | |||
if not es.indices.exists(INDEX): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for information: this check would cause AuthorizationException
in case of remote connection via Nginx proxy:
>>> es = elasticsearch.Elasticsearch('http://login:[email protected]:9200')
>>> es.indices.exists('test_prodsys_rucio_ami')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/lib/python2.7/site-packages/elasticsearch/client/indices.py", line 213, in exists
params=params)
File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 318, in perform_request
status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 185, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.AuthorizationException: TransportError(403, u'')
I updated proxy configuration to allow HEAD
requests (to readable locations), as there`s no actual need to block them; so now there should be no problem with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And (again, just for information), there was another way to hit the goal:
Check for index' existence before working.
es.get() raises NotFoundError in both cases - when index does not exist and
when document does not exist. Also, it's more reasonable to check index once
since it's the same for all messages.
Even with NotFoundError
it is possible to say one situation from another:
>>> try: es.get(index='test_prodsys_rucio_ami', doc_type='task', id=1468216)
... except Exception, err: pass
...
>>> err.info
{u'found': False, u'_type': u'task', u'_id': u'1468216', u'_index': u'test_prodsys_rucio_ami'}
>>>
>>> try: es.get(index='_no_such_index_', doc_type='task', id=14682166)
... except Exception, err: pass
>>> err.info
{u'status': 404, u'error': {u'index_uuid': u'_na_', u'index': u'tprodsys_rucio_ami', u'resource.type': u'index_expression', u'root_cause': [{u'index_uuid': u'_na_', u'index': u'tprodsys_rucio_ami', u'resource.type': u'index_expression', u'resource.id': u'tprodsys_rucio_ami', u'reason': u'no such index', u'type': u'index_not_found_exception'}], u'reason': u'no such index', u'type': u'index_not_found_exception', u'resource.id': u'tprodsys_rucio_ami'}}
>>> err.info['error']['reason']
u'no such index'
And in case of the error -- or, maybe, even in case of any error, when info['error']
is defined (but I am not sure if there can possibly be any other error) -- re-raise the exception to indicate that the process can not be continued. Maybe wrapping the exception into DataflowException
.
The only situation when it makes any difference is when during the process execution the index was removed or access policy changed: the check was successfully passed on the start, so any NotFoundError
after this will be taken as "record missed", no matter what. But I don`t think it is likely to happen, so there`s nothing wrong in the one-time check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general everything looks fine.
But I didn`t add any comments to the last commit (808280c): they will appear later or go to e-mail (need to think a little more about it).
The only thing that should be done in this PR before it can be merged is handling of "service" JSON fields. Specifically, two moments:
- "service" fields should be filtered out from the "data" fields before checking the consistency (as "service" fields most likely are not written to ES as document content fields);
- in case of parent/child relationship in ES, child document can not be taken by the ID with
get()
: it also requires parent ID specified.
The last point is where I feel like I might owe you an apology: if there was search()
method used (as you meant to do initially), it would work same way for both task
s and output_dataset
s. Another point is that ids
query is definitely more suitable for batch mode processing (which, I hope, will sooner or later be added to the processing mode).
Due to this I see here at least three ways to go:
- leave things 'as-is' (adding to README note that for
output_datasets
it most likely won`t work properly). It is OK as right now we only check the tasks consistency -- but may be a bit frustrating; - add
_parent
handling in current version. It looks like the most stright and cheap way to make things work as they were meant to; - rework things once again to use
ids
query. It feels like the more time consuming way, and may be excessive (see below why).
If you take a look at the PR #244 -- I`m working on some kind of a "common" wrapper around ES with commonly used functions, including the "get record [fields] by id[+type][+parent]", etc. When I finish I most probably will try to adapt your code to use it -- to check that the library module covers all (known) needed use cases, as well as to make sure that things that one of us have already stumbled over will not appear again in different ES-using cenarios (and will be handled same way everywhere). And as I tought about it here -- the "ids" version of get()
will be added there as well; so switching back to search()
only in the view of the batch processing plans may be a premature optimization.
But it still may look better for you that _parent
-aware version and/or leaving things as is (or maybe you see fourth and fifth way to go) -- so it is totaly up to you, what to choose.
@Evildoor, another thing I have just noticed after regular consistency check: the scenario seems not to handle |
Checked - this seems to happen as following:
Considering the possible situations:
I believe that this issue should be fixed by adding key |
f059a61
to
7117242
Compare
Moved 808280c into a separate branch per email discussion, so that we can finish and merge the existing commits without it hindering us (as it's a somewhat different problem that can possibly require extensive work to be dealt with): https://github.com/PanDAWMS/dkb/commits/oracle-es-consistency-limit |
@Evildoor wrote:
Right, thank you for checking. That`s what I thought is happening, too.
In fact, there is a similar situation with dkb/Utils/Dataflow/071_esConsistency/consistency.py Lines 154 to 156 in 7117242
I think these two cases ( Yet there`s nothing wrong in asking Stage 016 not to produce extra fields, as you suggest. Except that it would better to take care of both What is below is definitely out of this PR scope, yet worth thinking. This situation itself says that something went wrong in general: stage 016 performs not an "atomic" data transformation, and it hinders us from reusing it 'as is'. It means that the "good solution" here would be to split Stage 016 in two smaller, more "atomic" stages: one that takes care of ES indexing information, another -- transforming/extending/checking task metadata. And maybe the first one should work with any type of input messages, not only those with task metadata -- to move this functionality out from 091 as well, making it more "atomic" too. It is, again, shouildn`t be done in this PR; and the view of coming ES schema reworking (that will heavely affect the |
Good idea, let's do that. Meanwhile, I've implemented this suggestion:
for here-and-now checking of just |
9bee9f7
to
b8d412f
Compare
Good to hear this, we definitely need common library for ES-related work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is approved for further merging.
Yet there`s a couple of insignificant comments, so before I merge things please let me know when you take a look at them and decide to change (or not to change) anything.
b8d412f
to
ee65922
Compare
@mgolosova, please, review again. |
ee65922
to
93e650c
Compare
Type of _id is unknown - it can be str or int for task, and str for dataset.
The field is required to get child documents such as output datasets.
Service fields are different from data fields and shouldn't be checked.
These are unnecessary because library files are not supposed to be executed.
93e650c
to
8f86ddd
Compare
Add a mechanism for checking the consistency of data between Oracle and elasticsearch.
Currently the consistency is only checked for tasks (not datasets) by comparing the timestamps.
Consistency check is launched by executing
Utils/Dataflow/run/data4es-consistency-check
.https://trello.com/c/QjYjIFi6