add product conversion tests #53

renancunha · 2020-08-18T23:26:47Z

Similar to #52, this PR should not be merged into master. It was created to try to come closer to a real-world scenario, where we use the shublang to do data conversion between schemas.

Initially, a specification (a set of rules) was created to transform the data contained at the file tests/resources/unified_schema_product.json into a custom product schema that' is used at BV. We tried to cover all the fields present at the custom schema, according to the availability of data at the source schema.

Below are some considerations:

We should start to think about a functionality to solve Convert Free Form Text to Enum #26. It's very common to have to map a free form text to a set of predefined values (in Enum). In this specific example, we have to map the currency symbols (like £) to the currency abbreviations (like GBP), also had to map the stock situation (InStock, OutOfStock, etc). Do this kind of thing manually, using map, could get really weird, not-readable, and hard to maintain.
We need a pipe to convert data to string. There are some fields that are integers at the unified-schema product but should be strings at the customer schema, like all the fields related to reviews.
When we do some operation of filtering or something that results in a list with jmespath, we need to use the first twice to get exactly the data that we want. As an example, when we try to get a determined property from the additionalProperties list, we can do that in two ways:
- jmespath("additionalProperty[?name=='colour'].value | [0]") | first or
- jmespath("additionalProperty[?name=='colour'].value") | first | first
In these cases, we will need some kind of convention on which approach we should adopt.

It would be great to have discussions about the expressions created, possible optimizations, different approaches to get the same results, expressions complexity, etc.

akshayphilar · 2020-08-19T02:45:59Z

@ivankivanov could you please have a look at the missing fields and see if these can be mapped back to our schema
https://github.com/scrapinghub/shublang/pull/53/files#diff-ce8c62e2dd02132083d2a2d1a43490a8R48-R77

BurnzZ

Hi @renancunha , you're off to a great start here! Thanks for putting this out. It really helps visualize and see where we're currently at the moment in terms of conversions. 🎉

In addition to the three (3) main points you've raised, I have the following food for thought (that might also address 2 and 3):

So right off the bat we could actually see that:

all data conversions starts with jmespath
almost all data conversion ends with first
there was a need to transform a number into a string

I was thinking that we could avoid the superfluous and unnecessary operations above and have those type-conversions completely implicit. One way to achieve this is by defining the schema of the item transformation we want.

Such schema would contain:

the data type
the jmespath of the data source which complies with the Unified Schema

As a quick example, suppose we only need to retrieve a small subset of fields from the unified_schema_product.json you've used:

{
	"price_now": {
		"type": "str",
		"src": "offers[0].price"
	},

	# No need to do a `map(lambda x: str(x))`
	"rating_current": {
		"type": "str"
		"src": "aggregateRating.ratingValue",
	},

	"reviews_quantity_1_star": {
		"type": "str",

		# This actually would return: [4]
		# but we don't need to
		# - pipe jmespath with "... | [0]",
		# - do a `map(lambda x: str(x))`,
		# - and end it with a `first`
		# because the conversion module we'll be create
		# would handle it based on the type

		"src": "ratingHistogram[?ratingValue=='1'].ratingPercentage"
	},

	"image_urls": {
		"type": "List[str]",

		# Similar to PEP-585, we can specify the expected elements
		# of any iterables. The conversion module would be able to 
		# adapt to it automatically.

		"src": "images"
	}
}

Notice that we didn't need to write jmespath and first in any of them. :D

The doesn't actually solve the 1st concern you've raised which tackles #26 . I'll need to brew that one on my brain for a few days. 🤔

In any case, the schema we'll be defining would be used as input, alongside the data source, as you can see below:

Lemme know what you think. Cheers!

renancunha · 2020-08-24T14:39:28Z

Thanks for adding such great analyzes on our discussion here @BurnzZ! I agree with the points that you have mentioned. Looking at the "specifications" created to do this transformation above we see this repetitive pattern of getting the source field with jmespath and ending selecting the first element with the first function.

Some considerations:

About the need to transform number to string: I'll open a PR adding this functionality soon.
About defining the schema of the item transformation we want: I think that it makes total sense and this new schema definition will act as a specification file to the output that we want. There are cases where we need to also apply some rules (shublang functions) to transform the data. Considering these cases, maybe should we also specify at the new schema definition, for each field (if needed), a rules pipeline that our conversion module should execute at the src data? As an example, let's get the category field, which should be mapped from the breadcrumbs[*].name and then joined with join(" > "):

"category": {
    "type": "str",
    "src": "breadcrumbs[*].name",
    "rules": "join(' > ')"
}

By doing this will have some kind of separation of concerns: the src, as it's names suggests, will be responsible just to fetch some information from the data source and the rules will be responsible just to transformations. I don't know if naming it as rules is the best choice, it was just an idea.

I think having this three info (type, src, and rules) for each field, our module will be capable of handling every single case that came to my mind now.

BurnzZ · 2020-08-25T11:59:50Z

The third part would be a great addition @renancunha ! You're absolutely right about the separation of concerns.

I don't know if naming it as rules is the best choice, it was just an idea.

What do you think about calling it transformations? since to rule needs to actually enforce something, but in this case, we're transforming something.

So far, we have established a clear split among:

type - enforces the data type of the value, be it int, float, List[int], List[str], dict, etc.
src - where the field comes from with respect to the Unified Schema.
transformations - could be optional if we need any further modifications to the current data format or value. But I think it's important to note that the type could handle the most basic of conversions, like if we have a value of 299.00 and it's specified as "type": "str", then it's an automatic conversion into "299.00", without the need for transformations. Another example would be if 299.00 is specified as "type": "int", then it would be converted automatically into 299.

I'm trying to think of anything else we might add, but I think these are the fundamental ones. Anything else might just be special or edge cases for some specific use cases.

Do you think the next step would be for you to try out implementing a basic PoC for this "specifications" we've been discussing? 😄

renancunha · 2020-08-25T12:43:47Z

What do you think about calling it transformations? since to rule needs to actually enforce something, but in this case, we're transforming something.

Naming it as transformations makes much more sense, people will look at and immediately knows what is this, great catch @BurnzZ. I think we can proceed with this choice.

Do you think the next step would be for you to try out implementing a basic PoC for this "specifications" we've been discussing?

Yeah, I think that I can start to work on a PoC for creating this module responsible to parse a schema file (as a tree of fields), field by field, and executing the steps that we've discussed here (get the src, apply the transformations, resolve the result according to the type, etc). I will just give attention to #26 firstly because it will be necessary even in the simple schema conversions.

My question is where should I start to implement this PoC? It would be great to implement it in a repository where we can follow the progress of it (like we do here).

add complex use-cases of shublang

0b88c2f

renancunha requested review from BurnzZ and akshayphilar August 18, 2020 23:27

akshayphilar requested a review from ivankivanov August 19, 2020 02:41

BurnzZ reviewed Aug 20, 2020

View reviewed changes

renancunha mentioned this pull request Aug 24, 2020

add new 'str' functionality #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add product conversion tests #53

add product conversion tests #53

renancunha commented Aug 18, 2020

akshayphilar commented Aug 19, 2020

BurnzZ left a comment •

edited

Loading

renancunha commented Aug 24, 2020

BurnzZ commented Aug 25, 2020

renancunha commented Aug 25, 2020

add product conversion tests #53

Are you sure you want to change the base?

add product conversion tests #53

Conversation

renancunha commented Aug 18, 2020

akshayphilar commented Aug 19, 2020

BurnzZ left a comment • edited Loading

Choose a reason for hiding this comment

renancunha commented Aug 24, 2020

BurnzZ commented Aug 25, 2020

renancunha commented Aug 25, 2020

BurnzZ left a comment •

edited

Loading