-
Notifications
You must be signed in to change notification settings - Fork 288
Using ElasticSearch as Unit Index
This is the test script that populates ElasticSearch index with Pootle units.
Run it like that:
./populate_es_unit_index.pl --user=pootle --password=yOur_PaSSwOrd
(by default it connects to pootle
database on local MySQL instance).
#!/usr/bin/perl
=head1 NAME
populate_es_unit_index.pl - Populate ElasticSearch index with Pootle units
=head1 DESCRIPTION
B<populate_es_unit_index.pl> is an utility for bulk-updating ElasticSearch index with Pootle units for faster searching
=head1 SYNOPSIS
populate_es_unit_index.pl --host=localhost --user=username --password=password --port=port --database=database
=cut
use strict;
use DBI;
use DateTime;
use Getopt::Long;
use JSON -support_by_pp;
use LWP::UserAgent;
my $ELASTICSEARCH_HOST = 'http://localhost:9200';
my $ELASTICSEARCH_INDEX = 'pootle_test';
my $ELASTICSEARCH_TYPE = 'unit';
binmode(STDOUT, ":utf8");
$| = 1;
my ($host, $database, $username, $password, $port) = ('localhost', 'pootle', 'pootle', undef, 3306);
GetOptions(
"host|h:s" => \$host,
"user|u:s" => \$username,
"port|P:s" => \$port,
"database|D:s" => \$database,
"password=s" => \$password
);
my $dbh = DBI->connect("DBI:mysql:database=$database;host=$host;port=$port",
$username,
$password,
{
'mysql_enable_utf8' => 1,
'mysql_bind_type_guessing' => 1,
}
) or die "Can't connect to the mysql database.";
my $ua = LWP::UserAgent->new;
print "Deleting '$ELASTICSEARCH_INDEX' index in ElasticSearch...";
my $req = HTTP::Request->new('DELETE', "$ELASTICSEARCH_HOST/$ELASTICSEARCH_INDEX");
my $res = $ua->request($req);
check_error($res) if $res->code != 404;
print "\n";
my $date_type = {
type => 'date',
format => 'yyyy-MM-dd HH:mm:ss',
};
my $index_mapping = {
mappings => {
$ELASTICSEARCH_TYPE => {
properties => {
pootle_path => {
type => 'string',
index => 'not_analyzed',
},
index => {
type => 'long',
},
state => {
type => 'short',
},
mtime => $date_type,
submitted_on => $date_type,
creation_time => $date_type,
reviewed_on => $date_type,
}
}
}
};
print "Creating '$ELASTICSEARCH_INDEX' index with field mappings in ElasticSearch...";
my $req = HTTP::Request->new('PUT', "$ELASTICSEARCH_HOST/$ELASTICSEARCH_INDEX");
$req->header('Content-Type' => 'application/json');
$req->content(to_json($index_mapping, {utf8 => 1}));
my $res = $ua->request($req);
check_error($res);
print "\n";
my $bulk_url = "$ELASTICSEARCH_HOST/$ELASTICSEARCH_INDEX/$ELASTICSEARCH_TYPE/_bulk";
my $sql = "
SELECT u.id, s.pootle_path, u.index, u.state, u.mtime,
u.submitted_on, u.creation_time, u.reviewed_on
FROM pootle_store_unit u
JOIN pootle_store_store s ON u.store_id = s.id
";
my $sth = $dbh->prepare($sql) or die $dbh->errstr;
print "Executing SQL query...";
$sth->execute;
print " Done\n";
my ($start, $now);
$start = $now = DateTime->now;
my $i;
my @bulk_rows = ();
while (my $h = $sth->fetchrow_hashref) {
my $command = {
index => {
_id => $h->{id}
}
};
die "mtime is null" unless $h->{mtime};
my $data = {
pootle_path => $h->{pootle_path},
index => $h->{index},
state => $h->{state},
mtime => $h->{mtime},
};
$data->{submitted_on} = $h->{submitted_on} if $h->{submitted_on};
$data->{creation_time} = $h->{creation_time} if $h->{creation_time};
$data->{reviewed_on} = $h->{reviewed_on} if $h->{reviewed_on};
push @bulk_rows, to_json($command, {utf8 => 1});
push @bulk_rows, to_json($data, {utf8 => 1});
$i++;
if ($i % 10000 == 0) {
push_bulk_request();
}
}
push_bulk_request();
$sth->finish;
$dbh->disconnect;
$now = DateTime->now;
my $d = $now - $start;
print "[$now]\tAll done in ".join(':', map {$_ > 10 ? "$_" : "0$_"} ($d->hours(), $d->minutes(), $d->seconds()))."\n";
sub push_bulk_request {
$now = DateTime->now;
print "[$now]\t";
if (length(@bulk_rows) > 0) {
my $req = HTTP::Request->new('POST', $bulk_url);
$req->header('Content-Type' => 'application/json');
$req->content(join("\n", @bulk_rows));
my $res = $ua->request($req);
check_error($res);
}
@bulk_rows = ();
print "$i units were updated\n";
}
sub check_error {
my $res = shift;
if ($res->code >= 400) {
warn "ERROR: ", $res->code, "\n";
warn "Full response is: ", $res->decoded_content, "\n";
die;
}
}
I ran this on my local Pootle DB instance containing 2,409,581 units. Full indexing took 5 minutes 50 seconds. After that I ran some sample queries against ElasticSearch (I recommend using 'Sense' Chrome extension to run ES queries):
Average query execution time: ~25-30ms
POST /pootle_test/unit/_search
{
"from": 0,
"size": 20,
"query": {
"match_all": {}
},
"sort": ["pootle_path", "index"],
"_source": false
}
Average query execution time: ~60-70ms
POST /pootle_test/unit/_search
{
"from": 5000,
"size": 20,
"query": {
"match_all": {}
},
"sort": ["pootle_path", "index"],
"_source": false
}
Average query execution time: ~4-5ms
POST /pootle_test/unit/_search
{
"from": 0,
"size": 20,
"query": {
"prefix": {
"pootle_path": "/ru/"
}
},
"sort": ["pootle_path", "index"],
"_source": false
}
Getting 20 units starting from offset 5000, ordered by pootle_path
and index
for pootle_path
starting with /ru/
Average query execution time: ~30-40ms
POST /pootle_test/unit/_search
{
"from": 5000,
"size": 20,
"query": {
"prefix": {
"pootle_path": "/ru/"
}
},
"sort": ["pootle_path", "index"],
"_source": false
}
Once one gets a sorted vector of 20 unit IDs, they would need to run an SQL query against Pootle DB to retrieve the actual unit data, something like this:
SELECT *
FROM pootle_store_unit u
JOIN pootle_store_store s ON u.store_id = s.id
WHERE u.id IN (
1351041, 3372829, 1351040, 1351039, 1351037,
1357104, 1357105, 1381339, 1389822, 1424225,
1447226, 1464442, 1463805, 1464440, 1463797,
1463801, 1463807, 1464441, 1463806, 1463795
)
In my tests, the average execution time was always reported in my SQL DB admin as "0.000s" (in other words, less than 0.5ms + rounding, which is negligibly small and should probably not be taken into account at all).