-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing Classes (sequence and values) #13
Comments
Is this related to #8? |
not related On Thu, Jun 27, 2013 at 1:59 PM, Gabe Fierro [email protected]:
sent from mobile |
Looking in the XML for ipg120103, I can see that the main-classification tag that indicates the US class is the following
According to the current USPTO XML schema 4.2, the first 3 characters are the class and the last characters are the subclass. This gives us class = 52 and subclass = 7168. Again according to the USPTO XML schema 4.2, the first 3 decimals of the subclass are to the left of the decimal place, giving us subclass = 716.8, so it's definitely possible to parse out 52/716.8. The other classes I believe are from the US classifications of the cited patents. Should we extract the classes in this way? |
gotcha. so if we see >3 decimals, then we can add a period. what happens in other cases? |
From the documentation
|
right this stuff is so confusing. its like creating structure within a On Mon, Jul 1, 2013 at 3:06 PM, Gabe Fierro [email protected]:
sent from mobile |
BNF -> PEG (if possible) -> Test drive. |
I don't think it's really that complicated; we just have to decide how we want to transform the strings. The basic form is I think if we break up the class strings as:
and don't strip the spaces, we should be fine |
hey gabe. what does this data look like in DVN? to whatever extent we might On Tue, Jul 2, 2013 at 9:50 AM, Gabe Fierro [email protected]:
sent from mobile |
From what I can see, all the rows in /data/patentdata/DVNFIXED/class.sqlite3 look like
so because the current code doesn't handle the subclass decimals, if we handle that, then we should be great in terms of backwards compatibility. |
does the parsing of the classes strip away the "." and other punctuation? when i compare patent # 8087209 (ipg120103) with the USPTO equivalent, I see differences.
The parser returns [[u'52', u'7168'], [u'52', u'7161'], [u'52', u'463'], [u'52', u'464'], [u'52', u'2881']]
On the USPTO site, I see 52/716.8. Also, do we know why we see things in this order? The order differs from what is on the USPTO website.
The text was updated successfully, but these errors were encountered: