Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing Classes (sequence and values) #13

Open
laironald opened this issue Jun 26, 2013 · 10 comments
Open

Parsing Classes (sequence and values) #13

laironald opened this issue Jun 26, 2013 · 10 comments

Comments

@laironald
Copy link

does the parsing of the classes strip away the "." and other punctuation? when i compare patent # 8087209 (ipg120103) with the USPTO equivalent, I see differences.

The parser returns [[u'52', u'7168'], [u'52', u'7161'], [u'52', u'463'], [u'52', u'464'], [u'52', u'2881']]

On the USPTO site, I see 52/716.8. Also, do we know why we see things in this order? The order differs from what is on the USPTO website.

@gtfierro
Copy link
Member

Is this related to #8?

@laironald
Copy link
Author

not related

On Thu, Jun 27, 2013 at 1:59 PM, Gabe Fierro [email protected]:

Is this related to #8#8
?


Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-20154544
.

sent from mobile

@gtfierro
Copy link
Member

gtfierro commented Jul 1, 2013

Looking in the XML for ipg120103, I can see that the main-classification tag that indicates the US class is the following

<main-classification> 527168</main-classification> 

According to the current USPTO XML schema 4.2, the first 3 characters are the class and the last characters are the subclass. This gives us class = 52 and subclass = 7168. Again according to the USPTO XML schema 4.2, the first 3 decimals of the subclass are to the left of the decimal place, giving us subclass = 716.8, so it's definitely possible to parse out 52/716.8. The other classes I believe are from the US classifications of the cited patents.

Should we extract the classes in this way?

@laironald
Copy link
Author

gotcha. so if we see >3 decimals, then we can add a period. what happens in other cases?

@gtfierro
Copy link
Member

gtfierro commented Jul 1, 2013

From the documentation

Table 6 - U.S. Patent Classifications
Class – A 3-position alphanumeric field right justified with leading spaces.
Design Patents – The first position will contain a “D”. Positions 2 and 3, right justified,
with a leading space when required for a single digit class.
Plant Patents – Positions 1-3 will contain a “PLT”
All Other Patents – Three alphanumeric positions, right justified,
with leading spaces
Sub-Class – Three alphanumeric positions, right justified with leading spaces, and, if present, one to three >positions to the right of the decimal point (assumed decimal in the Red Book XML), left justified.

Note: An unstructured US classification would identify a sub-class
as a range with the sub-class range being separated by a hyphen “-“
A digest entry as a sub-class would appear as follows:
Three positions containing “DIG”, followed by one to three alphanumeric positions, left justified.

@laironald
Copy link
Author

right this stuff is so confusing. its like creating structure within a
small field because the peeps at that USPTO team didn't want to think about
creating new tags. i thin we can definitely add value by applying those
rules as most people wouldn't bother with this... what do you think? (i
know its painful)

On Mon, Jul 1, 2013 at 3:06 PM, Gabe Fierro [email protected]:

From the documentation

Table 6 - U.S. Patent Classifications
Class – A 3-position alphanumeric field right justified with leading
spaces.
Design Patents – The first position will contain a “D”. Positions 2 and 3,
right justified,
with a leading space when required for a single digit class.
Plant Patents – Positions 1-3 will contain a “PLT”
All Other Patents – Three alphanumeric positions, right justified,
with leading spaces
Sub-Class – Three alphanumeric positions, right justified with leading
spaces, and, if present, one to three >positions to the right of the
decimal point (assumed decimal in the Red Book XML), left justified.

Note: An unstructured US classification would identify a sub-class
as a range with the sub-class range being separated by a hyphen “-“
A digest entry as a sub-class would appear as follows:
Three positions containing “DIG”, followed by one to three alphanumeric
positions, left justified.


Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-20314062
.

sent from mobile

@doolin
Copy link
Member

doolin commented Jul 1, 2013

BNF -> PEG (if possible) -> Test drive.

http://fdik.org/pyPEG/

@gtfierro
Copy link
Member

gtfierro commented Jul 2, 2013

I don't think it's really that complicated; we just have to decide how we want to transform the strings.

The basic form is <class>/<sub-class>.<more-sub-class>. This is simple enough for Design Patents. For Plant patents, the first 3 characters are PLT, which seems to function as a class.

I think if we break up the class strings as:

  • class: string[:3]
  • subclass: string[3:6]
  • moresubclass: string[6:]

and don't strip the spaces, we should be fine

@laironald
Copy link
Author

hey gabe. what does this data look like in DVN? to whatever extent we might
want to match that, so its compatible.

On Tue, Jul 2, 2013 at 9:50 AM, Gabe Fierro [email protected]:

I don't think it's really that complicated; we just have to decide how we
want to transform the strings.

The basic form is /.. This is simple
enough for Design Patents. For Plant patents, the first 3 characters are
PLT, which seems to function as a classhttp://www.uspto.gov/web/offices/ac/ido/oeip/taf/def/plt.htm
.

I think if we break up the class strings as:

  • class: string[:3]
  • subclass: string[3:6]
  • moresubclass: string[6:]

and don't strip the spaces, we should be fine


Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-20358963
.

sent from mobile

@gtfierro
Copy link
Member

gtfierro commented Jul 2, 2013

From what I can see, all the rows in /data/patentdata/DVNFIXED/class.sqlite3 look like

Patent | Prim | Class | Subclass
03930270 | 1 | 360 | 130.24 

so because the current code doesn't handle the subclass decimals, if we handle that, then we should be great in terms of backwards compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants