Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: read bson data directly from dbpath (without mongod running) #4

Open
MadDataScience opened this issue May 7, 2012 · 4 comments

Comments

@MadDataScience
Copy link

It would be really cool of Hive-mongo could read directly from MongoDB files rather than having to go through a mongod process (this way I could run it directly against backups without having to start mongod on them). If this is too difficult/impossible, the next best thing would be to be able to run it against the bson files produced by mongodump (though at that point, I'm already halfway to exporting the data to another format anyway).

@yc-huang
Copy link
Owner

yc-huang commented May 8, 2012

Great idea! Currently it's not supported since we have different use cases:
we use mongodb to store some meta/user profile data, and we need to both
query and update to it.

The mongo dump file seems just a collection of BSON objects, so if there
have a delimiter for each row/bson object, which needed is just a bson
SerDe. (and a custom split implementation might also needed to enable
parallel processing). Not sure how difficult to implement this base on the
java driver's bson code, still need further investigation.

I think you could dump as CSV file using mongoexport as a workaround. If
the CSV is huge, compression(snappy, lzo,bz2,gzip) might helps.

On Tue, May 8, 2012 at 7:52 AM, Alessandro D. Gagliardi <
[email protected]

wrote:

It would be really cool of Hive-mongo could read directly from MongoDB
files rather than having to go through a mongod process (this way I could
run it directly against backups without having to start mongod on them). If
this is too difficult/impossible, the next best thing would be to be able
to run it against the bson files produced by mongodump (though at that
point, I'm already halfway to exporting the data to another format anyway).


Reply to this email directly or view it on GitHub:
#4

@MadDataScience
Copy link
Author

CSV is no good as we have shifting schemata and nested documents and all kinds of other madness that make CSV a mess. I imagine you're already aware of https://github.com/mongodb/mongo-hadoop but I thought I'd mention it just in case.

@yc-huang
Copy link
Owner

yeah, they have a wonderful shard-aware input split implementation and we'd
like to migrate Hive-mongo to use that...

On Wednesday, May 9, 2012, Alessandro D. Gagliardi wrote:

CSV is no good as we have shifting schemata and nested documents and all
kinds of other madness that make CSV a mess. I imagine you're already aware
of https://github.com/mongodb/mongo-hadoop but I thought I'd mention it
just in case.


Reply to this email directly or view it on GitHub:
#4 (comment)

@yc-huang
Copy link
Owner

Just got message from 10gen engineer that they have a hive connector which currently support static bson file:
https://github.com/mongodb/mongo-hadoop/tree/master/hive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants