Feature Request: read bson data directly from dbpath (without mongod running) #4

MadDataScience · 2012-05-07T23:52:26Z

It would be really cool of Hive-mongo could read directly from MongoDB files rather than having to go through a mongod process (this way I could run it directly against backups without having to start mongod on them). If this is too difficult/impossible, the next best thing would be to be able to run it against the bson files produced by mongodump (though at that point, I'm already halfway to exporting the data to another format anyway).

yc-huang · 2012-05-08T02:46:21Z

Great idea! Currently it's not supported since we have different use cases:
we use mongodb to store some meta/user profile data, and we need to both
query and update to it.

The mongo dump file seems just a collection of BSON objects, so if there
have a delimiter for each row/bson object, which needed is just a bson
SerDe. (and a custom split implementation might also needed to enable
parallel processing). Not sure how difficult to implement this base on the
java driver's bson code, still need further investigation.

I think you could dump as CSV file using mongoexport as a workaround. If
the CSV is huge, compression(snappy, lzo,bz2,gzip) might helps.

On Tue, May 8, 2012 at 7:52 AM, Alessandro D. Gagliardi <
[email protected]

wrote:

It would be really cool of Hive-mongo could read directly from MongoDB
files rather than having to go through a mongod process (this way I could
run it directly against backups without having to start mongod on them). If
this is too difficult/impossible, the next best thing would be to be able
to run it against the bson files produced by mongodump (though at that
point, I'm already halfway to exporting the data to another format anyway).

Reply to this email directly or view it on GitHub:
#4

MadDataScience · 2012-05-08T16:51:42Z

CSV is no good as we have shifting schemata and nested documents and all kinds of other madness that make CSV a mess. I imagine you're already aware of https://github.com/mongodb/mongo-hadoop but I thought I'd mention it just in case.

yc-huang · 2012-05-10T01:47:53Z

yeah, they have a wonderful shard-aware input split implementation and we'd
like to migrate Hive-mongo to use that...

On Wednesday, May 9, 2012, Alessandro D. Gagliardi wrote:

CSV is no good as we have shifting schemata and nested documents and all
kinds of other madness that make CSV a mess. I imagine you're already aware
of https://github.com/mongodb/mongo-hadoop but I thought I'd mention it
just in case.

Reply to this email directly or view it on GitHub:
#4 (comment)

yc-huang · 2012-06-25T02:09:53Z

Just got message from 10gen engineer that they have a hive connector which currently support static bson file:
https://github.com/mongodb/mongo-hadoop/tree/master/hive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: read bson data directly from dbpath (without mongod running) #4

Feature Request: read bson data directly from dbpath (without mongod running) #4

MadDataScience commented May 7, 2012

yc-huang commented May 8, 2012

MadDataScience commented May 8, 2012

yc-huang commented May 10, 2012

yc-huang commented Jun 25, 2012

Feature Request: read bson data directly from dbpath (without mongod running) #4

Feature Request: read bson data directly from dbpath (without mongod running) #4

Comments

MadDataScience commented May 7, 2012

yc-huang commented May 8, 2012

MadDataScience commented May 8, 2012

yc-huang commented May 10, 2012

yc-huang commented Jun 25, 2012