Using ElasticSearch as Unit Index

This is the test script that populates ElasticSearch index with Pootle units.

Run it like that:

./populate_es_unit_index.pl --user=pootle --password=yOur_PaSSwOrd

(by default it connects to pootle database on local MySQL instance).

`populate_es_unit_index.pl`

#!/usr/bin/perl
=head1 NAME

populate_es_unit_index.pl - Populate ElasticSearch index with Pootle units

=head1 DESCRIPTION

B<populate_es_unit_index.pl> is an utility for bulk-updating ElasticSearch index with Pootle units for faster searching

=head1 SYNOPSIS

populate_es_unit_index.pl --host=localhost --user=username --password=password --port=port --database=database

=cut

use strict;

use DBI;
use DateTime;
use Getopt::Long;
use JSON -support_by_pp;
use LWP::UserAgent;

my $ELASTICSEARCH_HOST = 'http://localhost:9200';
my $ELASTICSEARCH_INDEX = 'pootle_test';
my $ELASTICSEARCH_TYPE = 'unit';

binmode(STDOUT, ":utf8");
$| = 1;

my ($host, $database, $username, $password, $port) = ('localhost', 'pootle', 'pootle', undef, 3306);

GetOptions(
	"host|h:s" => \$host,
	"user|u:s" => \$username,
	"port|P:s" => \$port,
	"database|D:s" => \$database,
	"password=s" => \$password
);

my $dbh = DBI->connect("DBI:mysql:database=$database;host=$host;port=$port",
	$username,
	$password,
	{
        'mysql_enable_utf8' => 1,
        'mysql_bind_type_guessing' => 1,
    }
) or die "Can't connect to the mysql database.";

my $ua = LWP::UserAgent->new;

print "Deleting '$ELASTICSEARCH_INDEX' index in ElasticSearch...";
my $req = HTTP::Request->new('DELETE', "$ELASTICSEARCH_HOST/$ELASTICSEARCH_INDEX");
my $res = $ua->request($req);
check_error($res) if $res->code != 404;
print "\n";

my $date_type = {
    type => 'date',
    format => 'yyyy-MM-dd HH:mm:ss',
};

my $index_mapping = {
    mappings => {
        $ELASTICSEARCH_TYPE => {
            properties => {
                pootle_path => {
                    type => 'string',
                    index => 'not_analyzed',
                },
                index => {
                    type => 'long',
                },
                state => {
                    type => 'short',
                },
                mtime => $date_type,
                submitted_on => $date_type,
                creation_time => $date_type,
                reviewed_on => $date_type,
            }
        }
    }
};

print "Creating '$ELASTICSEARCH_INDEX' index with field mappings in ElasticSearch...";
my $req = HTTP::Request->new('PUT', "$ELASTICSEARCH_HOST/$ELASTICSEARCH_INDEX");
$req->header('Content-Type' => 'application/json');
$req->content(to_json($index_mapping, {utf8 => 1}));
my $res = $ua->request($req);
check_error($res);
print "\n";

my $bulk_url = "$ELASTICSEARCH_HOST/$ELASTICSEARCH_INDEX/$ELASTICSEARCH_TYPE/_bulk";

my $sql = "
    SELECT u.id, s.pootle_path, u.index, u.state, u.mtime,
    u.submitted_on, u.creation_time, u.reviewed_on
    FROM pootle_store_unit u
    JOIN pootle_store_store s ON u.store_id = s.id
";
my $sth = $dbh->prepare($sql) or die $dbh->errstr;

print "Executing SQL query...";
$sth->execute;
print " Done\n";

my ($start, $now);
$start = $now = DateTime->now;

my $i;
my @bulk_rows = ();
while (my $h = $sth->fetchrow_hashref) {
    my $command = {
        index => {
            _id => $h->{id}
        }
    };

    die "mtime is null" unless $h->{mtime};

    my $data = {
        pootle_path => $h->{pootle_path},
        index => $h->{index},
        state => $h->{state},
        mtime => $h->{mtime},
    };
    $data->{submitted_on} = $h->{submitted_on} if $h->{submitted_on};
    $data->{creation_time} = $h->{creation_time} if $h->{creation_time};
    $data->{reviewed_on} = $h->{reviewed_on} if $h->{reviewed_on};

    push @bulk_rows, to_json($command, {utf8 => 1});
    push @bulk_rows, to_json($data, {utf8 => 1});

    $i++;
    if ($i % 10000 == 0) {
        push_bulk_request();
    }
}
push_bulk_request();

$sth->finish;
$dbh->disconnect;

$now = DateTime->now;
my $d = $now - $start;

print "[$now]\tAll done in ".join(':', map {$_ > 10 ? "$_" : "0$_"} ($d->hours(), $d->minutes(), $d->seconds()))."\n";

sub push_bulk_request {
    $now = DateTime->now;
    print "[$now]\t";
    if (length(@bulk_rows) > 0) {
        my $req = HTTP::Request->new('POST', $bulk_url);
        $req->header('Content-Type' => 'application/json');
        $req->content(join("\n", @bulk_rows));
        my $res = $ua->request($req);
        check_error($res);
    }
    @bulk_rows = ();
    print "$i units were updated\n";
}

sub check_error {
    my $res = shift;
    if ($res->code >= 400) {
        warn "ERROR: ", $res->code, "\n";
        warn "Full response is: ", $res->decoded_content, "\n";
        die;
    }
}

I ran this on my local Pootle DB instance containing 2,409,581 units. Full indexing took 5 minutes 50 seconds. After that I ran some sample queries against ElasticSearch (I recommend using 'Sense' Chrome extension to run ES queries):

Getting first 20 units ordered by `pootle_path` and `index`

Average query execution time: ~25-30ms

POST /pootle_test/unit/_search

{
   "from": 0,
   "size": 20,
   "query": {
      "match_all": {}
   },
   "sort": ["pootle_path", "index"],
   "_source": false
}

Getting 20 units starting from offset 5000, ordered by `pootle_path` and `index`

Average query execution time: ~60-70ms

POST /pootle_test/unit/_search

{
   "from": 5000,
   "size": 20,
   "query": {
      "match_all": {}
   },
   "sort": ["pootle_path", "index"],
   "_source": false
}

Getting first 20 units ordered by `pootle_path` and `index` for `pootle_path` starting with `/ru/`

Average query execution time: ~4-5ms

POST /pootle_test/unit/_search

{
   "from": 0,
   "size": 20,
   "query": {
      "prefix": {
         "pootle_path": "/ru/"
      }
   },
   "sort": ["pootle_path", "index"],
   "_source": false
}

Getting 20 units starting from offset 5000, ordered by `pootle_path` and `index` for `pootle_path` starting with `/ru/`

Average query execution time: ~30-40ms

POST /pootle_test/unit/_search

{
   "from": 5000,
   "size": 20,
   "query": {
      "prefix": {
         "pootle_path": "/ru/"
      }
   },
   "sort": ["pootle_path", "index"],
   "_source": false
}

SQL query to retrieve actual data

Once one gets a sorted vector of 20 unit IDs, they would need to run an SQL query against Pootle DB to retrieve the actual unit data, something like this:

SELECT *
FROM pootle_store_unit u
JOIN pootle_store_store s ON u.store_id = s.id
WHERE u.id IN (
    1351041, 3372829, 1351040, 1351039, 1351037,
    1357104, 1357105, 1381339, 1389822, 1424225,
    1447226, 1464442, 1463805, 1464440, 1463797,
    1463801, 1463807, 1464441, 1463806, 1463795
)

In my tests, the average execution time was always reported in my SQL DB admin as "0.000s" (in other words, less than 0.5ms + rounding, which is negligibly small and should probably not be taken into account at all).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ElasticSearch as Unit Index

`populate_es_unit_index.pl`

Getting first 20 units ordered by `pootle_path` and `index`

Getting 20 units starting from offset 5000, ordered by `pootle_path` and `index`

Getting first 20 units ordered by `pootle_path` and `index` for `pootle_path` starting with `/ru/`

Getting 20 units starting from offset 5000, ordered by `pootle_path` and `index` for `pootle_path` starting with `/ru/`

SQL query to retrieve actual data

Clone this wiki locally

Using ElasticSearch as Unit Index

populate_es_unit_index.pl

Getting first 20 units ordered by pootle_path and index

Getting 20 units starting from offset 5000, ordered by pootle_path and index

Getting first 20 units ordered by pootle_path and index for pootle_path starting with /ru/

Getting 20 units starting from offset 5000, ordered by pootle_path and index for pootle_path starting with /ru/

SQL query to retrieve actual data

Clone this wiki locally

`populate_es_unit_index.pl`

Getting first 20 units ordered by `pootle_path` and `index`

Getting 20 units starting from offset 5000, ordered by `pootle_path` and `index`

Getting first 20 units ordered by `pootle_path` and `index` for `pootle_path` starting with `/ru/`

Getting 20 units starting from offset 5000, ordered by `pootle_path` and `index` for `pootle_path` starting with `/ru/`