Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ancestry with ActiveImport #433

Open
Brotakuu opened this issue Mar 19, 2019 · 6 comments
Open

Ancestry with ActiveImport #433

Brotakuu opened this issue Mar 19, 2019 · 6 comments

Comments

@Brotakuu
Copy link

I'm using active import to import comment trees. Is there any way to create the ancestry tree during the import (when the primary key id is not created yet)?

Currently, I'm importing all the comments, then looping through each comment to add the parent, and it is very slow even with indexes:

comments_to_import = []
import_data.each do |comment|
    comments_to_import << Comment.new(is_parent: comment.parent_uuid.nil?, uuid: comment.uuid, parent_uuid: comment.parent_uuid, body: comment.body)
end
Comment.import comments_to_import

# build comment tree
Comment.where(is_parent: false, ancestry: nil).all.each do |c|
    parent = Comment.find_by(uuid: c.parent_uuid)
    c.update(parent: parent)
end
@kbrock
Copy link
Collaborator

kbrock commented Mar 19, 2019

@Brotakuu are you doing a one time import, or is this a common process?
Is this postgres or mysql?

There is a built in method called: Model.build_ancestry_from_parent_ids!. but this does go through ruby. So, if you have a bunch of models, we may need to find a sql. approach

It is tricky to populate a bunch of ancestry parents. Also, it sometimes gets botched if you don't populate the parents, then their children, then their children.

I wonder if it is possible to make a sql statement that runs a few times that will build up the ancestry values 100% in the db.

This is back of the napkin type coding:

-- update root nodes
update comments
set ancestry = parent.id
from comments as parent ON parent.id = c.parent_id
where c.ancestry is null -- where I haven't been processed yet
and parent.parent_id is null -- my parent is a root

then keep going until nothing gets updated (update_all returns a 0)

-- update each level of nodes
update comments
set ancestry = concat(parent.ancestry,'/',parent.id)
from comments as parent ON parent.id = c.parent_id
where c.ancestry is null -- where I haven't been processed yet
and parent.ancestry is not null -- but my parent has been processed

If this looks like it is working, maybe we can get it into the product. Since updating the hierarchy is a pretty common task

@Brotakuu
Copy link
Author

Thanks @kbrock. I'm using postgresql and this is a weekly process of importing 300k comments which only have parent_uuid column referencing the direct parent comment. So i'm using ancestry to build the entire comment tree.

During the bulk import process, there's no way to set the parent_id since the parent comment is created at the same time as the descendants. So Model.build_ancestry_from_parent_ids! wouldn't work, unless I could run a similar method using the parent_uuid column.

So you're suggesting running a recursive sql query to build the tree from top down? Do you think this would be faster than what I'm currently running:

Comment.where(is_parent: false, ancestry: nil).all.each do |c|
    parent = Comment.find_by(uuid: c.parent_uuid)
    c.update(parent: parent)
end

@kbrock
Copy link
Collaborator

kbrock commented Mar 20, 2019

@Brotakuu think your code is pretty similar to build_ancestry_from_parent_ids! with the record passed in.

It sounds like you are building the whole tree of comments. If the depth is more than 1 deep (just parents and children), then it is important to update the records in a particular order. #429 explains how just setting the parent without paying attention to the order is troublesome.

You need to make sure the parent's ancestry is setup properly before the child's ancestry is set. Otherwise the child ancestry trees will be partial and not setup correctly.

@kbrock
Copy link
Collaborator

kbrock commented Jul 4, 2020

the performance differences will be <1 second vs an N+1. for us it was a few minutes vs under a second

if you look at the ruby, comment.ancestry = comment.parent.ancestry + "/" + comment.parent.id assuming that parent has a parent. If parent does not have a parent, then it is comment.ancestry = comment.parent.id

so if you know either of those cases, you can simplify the resulting sql

so right now we have

comment.ancestry =
  case comment.parent.ancestry is null then concat(comment.parent.ancestry, "/",comment.parent.id)
  else comment.parent_id
  end

given that parent is a self join and parent either has a nil parent id and nil ancestry or non nil parent id and non nil ancestry

in the other cases, the parent hasn't been updated yet and we need to approach in a separate pass.

you can probably get some ideas from
https://github.com/ManageIQ/manageiq-schema/pull/465/files

shows us starting at the bottom and going deeper.

@d-m-u do you have bandwidth to fix this? I'm thinking we have too many loose ends right now to approach. (and this is from 2019. so while lots of people need this, including us, it may unfortunately take a backseat)

@d-m-u
Copy link
Contributor

d-m-u commented Jul 7, 2020

@kbrock I think it's worth a shot.

@donnguyen
Copy link

@kbrock @d-m-u do we have any updates on this? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants