data-import is a data-migration framework. The goal of the project is to provide a simple api to migrate data from a legacy schema into a new one. It's based on jeremyevans/sequel.
gem 'data-import'
you can put your migration configuration in any file you like. We suggest something like mapping.rb
.
You can find the various ways to connect described in the sequel docs.
source 'sqlite://legacy_blog.db'
target :adapter => :postgres, :host => 'localhost', :user => 'user', :password => 'password', :database => 'blog'
import 'Animals' do
from 'tblAnimal', :primary_key => 'sAnimalID'
to 'animals'
mapping 'sAnimalID' => 'id'
mapping 'strAnimalTitleText' => 'name'
mapping 'sAnimalAge' => 'age'
mapping 'convert threat to danger rating' do
rating = ['none', 'medium', 'big'].index(row[:strThreat]) + 1
{:danger_rating => rating}
end
end
to run the import just execute:
mapping_path = Rails.root + 'mapping.rb'
DataImport.run_config! mapping_path
if you execute the import frequently you can create a Rake-Task:
desc "Imports the date from the source database"
task :import do
mapping_path = Rails.root + 'mapping.rb'
options = {}
options[:only] = ENV['RUN_ONLY'].split(',') if ENV['RUN_ONLY'].present?
DataImport.run_config! mapping_path, options
end
You can set the import to fail fast if your definitions contain mapping errors like referencing non-existent columns:
options = { strict_mode: true }
DataImport.run_config! mapping_path, options
data-import provides a clean dsl to define your mappings from the legacy schema to the new one.
data-import allows you to definie a global filter. This filter can be used to make global transformations like encoding fixes. You can define a filter, which downcases every string like so:
before_filter do |row|
row.each do |k, v|
row[k] = v.downcase if v.respond_to?(:downcase)
end
end
You've already seen a very basic example of the dsl in the Installation-Section. This part shows off the features of the mapping-DSL.
every mapping starts with a call to import
followed by the name of the mapping. You can name mappings however you like. The block passed to import contains the mapping itself. You can supply the source-table with from
and the target-table with to
. Make sure that you set the primary-key on the source-table otherwhise pagination is not working properly and the migration will fill up your RAM.
import 'Users' do
from 'tblUser', :primary_key => 'sUserID'
to 'users'
In simple cases you would read from one table and write to another.
import 'Items' do
from 'tblItem'
to 'items'
end
This is not always sufficient. This gem allows you to specify a custom data source using the sequel syntax or plain SQL.
import 'Items' do
from 'items' do |sequel|
sequel[:tblItems].join(:tblOrderItems, :sItemID => :sID)
end
to 'items'
end
or
import 'Items' do
from 'items' do |sequel|
sequel[<<-SQL
SELECT *
FROM tblItems
INNER JOIN tblOrderItems ON tblOrderItems.sItemID = tblItems.sID
SQL
]
sequel[:tblItems].join(:tblOrderItems, :sItemID => :sID)
end
to 'items'
end
By default a new record will be inserted for record read from the source. This behaviour can be changed. For example you may want to update existing records.
import 'Article Authors' do
from 'tblArticleAbout', :primary_key => 'sID'
to 'articles', :mode => :update
mapping 'lArticleId' => 'id'
mapping 'strWho' => 'author'
end
With :mode => :update
you tell data-import to update a record
instead of inserting a new one. You have to specify a mapping for the
primary key of the target table. If there is no value for the primary
key then nothing will be updated.
There is also a unique writer to filter double records.
import 'Cities' do
from 'tblCities', :primary_key => 'sID'
to 'cities', :mode => [:unique, :columns => [:name, :zip]]
mapping 'strName' => 'name'
mapping 'sZip' => 'zip'
mapping 'strShort' => 'short_name'
end
Passing the option :mode => [:unique, :columns => [:name, :zip]]
makes data-import use a unique writer. :columns
must be an array of
column which will be used to identify a double record. Before a new
record will be inserted, data-import makes a select on the target
table with the defined columns. So depending on the size of your
table, you may consider adding an (unique) index on those columns to
speed up the import.
You can create simple name-mappings with a call to mapping
:
mapping 'sUserID' => 'id'
mapping 'strEmail' => 'email'
mapping 'strUsername' => 'username'
If you need to process a column you can add a block. You have access to the entire record read from the source database. The return value of the block should be a hash or nil. Nil means no mapping at all and in case of a hash you have to use the column-names of the target-table as keys.
mapping 'convert threat to danger rating' do
rating = ['none', 'medium', 'big'].index(row[:strThreat]) + 1
{:danger_rating => rating}
end
If you have static data that needs to be inserted you can used to following feature:
import 'managers' do
from 'tblManagers'
to 'managers'
seed :earns_much_money => true, :has_emplyees => true
end
Use after row blocks to specify some logic that will be executed after a row has been inserted.
import 'Sales Leads' do
from 'SL_NewLeads', :primary_key => 'slNewLeadsID'
to 'sales_leads'
mapping 'slNewLeadsID' => :id
after_row do
target_database[:contacts].insert(:firstname => row[:slName1],
:lastname => row[:slName2])
end
end
Rows can be validated before insertion.
import 'People' do
from 'Person'
to 'females'
mapping 'Name' => :name
mapping 'Gender' => :gender
validate_row do
if mapped_row[:gender] == 'f'
true
else
logger.info "Row #{row} skipped since the gender is male"
false
end
end
end
Inside the validation block you have access to the row read from the
data source (row
) and to the row with all mappings applied
(mapped_row
). If the result of the validation block is evaluated to
true, then the row will be inserted into the target table. If the
result is false, insertion will be skipped.
If you have a more complex mapping than just reading from one source and writing each record to another, you can define script blocks. Inside a script block you can write ruby code that does your data conversion. The whole block runs in a transaction to ensure consistency.
script 'my compex converion' do
body do
log_texts = source_database.db[:Logger].map {|record| record[:strText]}
target_database.db[:log].insert(log_texts.join(','))
end
end
source_database.db
and target_database.db
are sequel database objects. Look at the
sequel docs for more information.
You can specify dependencies between definitions. Dependencies are always run before a given definition will be executed. Adding all necessary dependencies also allows you to run a set of definitions instead of everything.
import 'Roles' do
from 'tblRole', :primary_key => 'sRoleID'
to 'roles'
end
import 'SubscriptionPlans' do
from 'tblSubcriptionCat', :primary_key => 'sSubscriptionCatID'
to 'subscription_plans'
end
import 'Users' do
from 'tblUser', :primary_key => 'sUserID'
to 'users'
dependencies 'SubscriptionPlans'
end
import 'Permissions' do
from 'tblUserRoles'
to 'permissions'
dependencies 'Users', 'Roles'
end
you can now run parts of your mappings using the :only option:
DataImport.run_config! 'mappings.rb', :only => ['Users'] # => imports SubscriptionPlans then Users
DataImport.run_config! 'mappings.rb', :only => ['Roles'] # => imports Roles only
DataImport.run_config! 'mappings.rb', :only => ['Permissions'] # => imports Roles, SubscriptionPlans, Users and then Permissions
If you have tables referenced on other fields than their primary-key you need to perform lookups while migrating the data. data-import provides a feature called lookup-tables to basically create an index on any given field to the migrated primary-key.
The following example shows a table People
which is linked to the table Organizations
. Sadly the legacy-schema used the field :code from the Organizations
-table as reference.
import 'Organizations' do
# define a lookup-table on the :sOrgId attribute named :sOrgId
lookup_for :sOrgId
# define a lookup-table on the :strCode attribute named :code
lookup_for :code, :column => 'strCode'
end
import 'People' do
dependencies 'Organizations'
# you can then use the previously defined lookup-table on :code to get the primary-key
reference 'Organizations', 'OrganizationCode' => :org_id, :lookup => :code
# or you can do the same thing manually (this also works in script blocks)
mapping 'organization code' do
{:org_id => definition('Organizations').identify_by(:code, row['strCode'])}
end
end
If you don't specify the option :lookup
then data-import uses the
lookup table called :id
.
In every block, be this mapping, after_row, script body, validation, etc., you have access to a logger, which can be accessed as follows:
import 'Animals' do
# source and target config
# mapping config
validate_row do
logger.info "animal name has been mapped from #{row[:strName]} to #{mapped_row[:name]}"
end
end
The logger supports the standard log levels debug, info, warn, error and fatal.
This gem supports two different kinds of logging. The full logger prints every bit information. By default this will be printed to a file called import.log in the project root. The important logger only prints messages of levels warn, error or fatal, which will be displayed on STDOUT be default. The file import.log will therefore hold every message you log in the data migration process. On STDOUT will only see severe messages beside the progress bar. The reason for this distinction is that you don't want STDOUT to be flooded by debug messages.
Full and important logger can be configured as follows:
DataImport.full_logger = Logger.new(STDOUT)
DataImport.important_logger = Logger.new(STDERR)
You can apply any object that provides the methods debug
, info
,
warn
, error
and fatal
.
you can learn a lot from the acceptance specs.
Just send me a message and I'll try to get to you as soon as possible.
Please submit a new issue.
- Fork data-import
- Create a topic branch -
git checkout -b my_branch
- Make your changes and update the History.txt file
- Push to your branch -
git push origin my_branch
- Send me a pull-request for your topic branch
- That's it!