-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use random instead of sequential bookids. #103
Comments
Seems unnecessarily complex for a very specific use. I've been training On Fri, Jun 24, 2016, 9:20 AM Benjamin Schmidt [email protected]
|
The numbers here are small enough that there doesn't need to be hashing for assignment. Essentially, right now, the books are assigned an id from popping off the list It should be possible to run that shuffle with a seed. But I'm not sure what the benefits would be of that. If using a four-byte integer (as they do in Chicago, I've heard), the assignment would have to check for collisions against the already-assigned documents. Which is OK, because ID-assignment has never been a thread-safe (or thread-demanding) process. |
A niche case for some extensions is to use a bookworm word count index as an input to a batched machine learning process. I sometimes want to train a neural network on word counts, for example.
A
SELECT * FROM master_bookcounts
query will provide this data, sorted by bookid. Most large text corpora have some pre-specified order. For example, Chronicling America is arrange by newspaper, and then by year within each newspaper. This will make learning outputs less effective, since each batch should be as representative of the whole as possible. A gradient descent algorithm might lurch quite far off in the wrong direction, or even (gasp) find and settle into a local maximum.If bookids are randomly assigned, rather than assigned sequentially, there's a workaround for this: the primary table keys will allow
SELECT * FROM master_bookcounts ORDER BY bookid
to be delivered nearly as fast as a table scan, but it will now have a random order.Since bookids are arbitrary and internal, I don't see a downside here. It's possible, though, that index construction might take longer if the tables aren't sorted to begin with. So that's something to watch out for.
This might create problems for additions of bookids down the road. Therefore, the random ids should be drawn from the full range
1,16777215
, which is the maximum bookid supported under the standard scheme, not just the range1,catalog_length
.The text was updated successfully, but these errors were encountered: