Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xbrl / xml documents embedded in .txt submission #6

Open
jsfenfen opened this issue Jun 4, 2013 · 2 comments
Open

xbrl / xml documents embedded in .txt submission #6

jsfenfen opened this issue Jun 4, 2013 · 2 comments

Comments

@jsfenfen
Copy link
Contributor

jsfenfen commented Jun 4, 2013

In Models.xbrl, xbrl_localpath() assumes the xbrl filename has an .xml extension. But in some cases the xml/xbrl documents appear to have been included in a larger text submission. For example, see here: http://www.sec.gov/Archives/edgar/data/320193/0001193125-12-023398.txt (there are several distinct xbrl files included there). It looks like the text file also includes a binary zip file within a block. All inside a .txt file. Which is, uh, odd.

I came across this while looking into handling 10-Q filings--perhaps this isn't an issue for 10-K's. What do you think is the best way to handle this? Should parsing xml from within the .txt file be part of the download() step? Or is there another file location that these should be pulled from?

@lukerosiak
Copy link
Owner

The file listed in the SEC's index (the filename attribute on the Index
model) points to that single file, which includes the HTML version,
its supplements, the XBRL versions, and any images (in binary) all in one,
separated by XML-like tags. It's the URL at the Index.html_link() method:
http://www.sec.gov/Archives/edgar/data/38067/000119312512146114/0001193125-12-146114.txt

That string can be transformed into the human-readable index of all those
components at index_link():
http://www.sec.gov/Archives/edgar/data/38067/000119312512146114/0001193125-12-146114-index.htm

But I don't know how, other than parsing that HTML, to automatically get
the URL of the main submission. So I figured I'd access the version of the
10-K filed for human consumption by downloading the bigass file and
extracting the HTML chunk.

But as you can see on that sample index page, there's the main html file,
then a bunch more HTML files, some of which appear to be interesting
tables, others which are more form letter type things. The index says there
are 17 documents. And when I search for in the big .txt file, there
are 107 occurrences. So I'm not sure what's going on and the .html() method
as currently written won't quite do what I had in mind which was capture
all the narrative portion of the 10-K laid out for humans. It will only
capture some that may or may not be the biggest portion.

In any case, all of that is to permit the possibility of text analysis of
narratives. In terms of parsing structured financial data, the xbrl_link()
should find the path to the zip file containing the XBRL--I have focused
exclusively on 10-Ks on all this, though I figured it would work for all
types that have XBRL associated with them... but maybe there is another
pattern that is used to build links to quarterlies.

On Tue, Jun 4, 2013 at 12:56 AM, Jacob Fenton [email protected]:

In Models.xbrl, xbrl_localpath() assumes the xbrl filename has an .xml
extension. But in some cases the xml/xbrl documents appear to have been
included in a larger text submission. For example, see here:
http://www.sec.gov/Archives/edgar/data/320193/0001193125-12-023398.txt(there are several distinct xbrl files included there). It looks like the
text file also includes a binary zip file within a block. All
inside a .txt file. Which is, uh, odd.

I came across this while looking into handling 10-Q filings--perhaps this
isn't an issue for 10-K's. What do you think is the best way to handle
this? Should parsing xml from within the .txt file be part of the
download() step? Or is there another file location that these should be
pulled from?


Reply to this email directly or view it on GitHubhttps://github.com//issues/6
.

@jsfenfen
Copy link
Contributor Author

jsfenfen commented Jun 8, 2013

Ok, at some point I'll look more closely at the SEC's spec for this kind of submission (I think it's there). If I'm following, there's a unique seqence_number for each individual piece. My sense is the way to handle this is with a separate model entirely--call it index_document. At the point in which a filing is downloaded, I'd populate index_document. If it's a normal 10K and the xbrl is found, then index_document is just the xbrl file; if it's a giant mishmash of assorted formats tagged as text, then some (all? only the xbrl?) are extracted, saved to the local file system, and entered into the index_document. Not sure this the best path, but I find it appealing because it gives the possibility of including other files for later analysis. Also, it may have some bearing on 8K's referenced here: #3 -- but I'm not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants