Skip to content

tX Development Architecture

Richard Mahn edited this page Aug 30, 2016 · 37 revisions

Document Status: [Draft|Proposal|Accepted]

tX Development Architecture

This document explains the layout of the translationConvertor (tX) conversion platform and how the components of the system should interact with one another.

If you just want to use the tX API, see tX API Example Usage

Keep reading if you want to contribute to tX.

Goals

tX is intended to be a conversion tool for the content in the Door43 Platform. The goal is to support several different input formats, output formats, and resource types.

Development goals are:

  • Keep the system modular, in order to:
    • Encourage others to contribute and make it simple to do so
    • Contain development, testing, and deployment to each individual component
    • Constrain feature, bugfixes, and security issues to a smaller codebase for each component
  • Continuous Deployment, which means
    • Automated testing is required
    • Continuous integration is required
    • Checks and balances on our process
  • RESTful API utilizing JSON

Infrastructure

Overview

All code for tX is run by AWS Lambda. The AWS API Gateway service is what provides routing from URL requests to Lambda functions. Data and any required persistent metadata are stored in AWS S3 buckets. This is a "serverless" API.

Developers use Apex, Travis CI, and Coveralls.

Permissions (mostly for accessing S3 buckets) are managed by the role assigned to each Lambda function.

Modules may be written in any language supported by AWS Lambda (including some that are available via "shimming"). As of July, 2016, this list includes:

  • Java (v8)
  • Python (v2.7)
  • Node.js (v0.10 or v4.3)
  • Go lang (any version)

Modules MUST all present an API endpoint that the other components of the system can use. Modules MAY present API endpoints that the public can use.

Separating Production from Development

We want our code to not know/care if it is running in production or development environments. Yet there are plenty of variables and locations that data and files are stored that vary from the two, such as different bucket names between our two AWS accounts, since all bucket names on AWS must be unique.

So that the clients, tx-manager and convert modules don't have to worry about this, everything that varies from environment will be set up in the API Gateway Stage Variables. These are variables we set up in AWS for a particular API's URL. Along with the payload sent by the requesting client, these variables will also be put into the "event" variable in the Lambda handle function.

For example, such variables may be:

  • cdn_bucket = "test-cdn.door43.org" or "cdn.door43.org"
  • api_bucket = "test-api.door43.org" or "api.door43.org"
  • door43_bucket = "test-door43.org" or "door43.org"
  • gogs_user_token = "<a user token from test.door43.org:3000>" or "<a user token from git.door43.org>"
  • gogs_username = "<username of the above user_token>"
  • tx_manager_url = "<URL of the tx-manager API Gateway>"
  • env = "development" or "production" (just in case you want to still do something different based on environment in your code)

Development Environment

The development environment should use the WA AWS account. There are 3 test buckets that have been created that mirror the production buckets:

  • test-api.door43.org - for tx-manager to manage data for tX (only /tx namespace should be used) (public access disabled on this)
  • test-cdn.door43.org - for conversion modules to upload their output to (only /tx namespace should be used) (public access enabled on this)
  • test-door43.org - For Jekyll and /u generated files to upload to (public access enabled on this)

The develop branch for each repo should automatically deploy to this account and make use of the above buckets.

Production Environment

The production environment should use the Door43 AWS account. The production buckets are:

  • api.door43.org - for tx-manager to manage data for tX (only /tx namespace should be used) (public access disabled on this)
  • cdn.door43.org - for conversion modules to upload their output to (only /tx namespace should be used) (public access enabled on this)
  • door43.org - For Jekyll and /u generated files to upload to (public access enabled on this)

The master branch for each repo should automatically deploy to this account and make use of the above buckets.

Modules

Every part of tX is broken into components referred to as tX modules. Each tX module has one or more functions that it provides to the overall system. The list of tX modules is given here, with a full description in its respective heading below.

tX Management Module

The tX Management Module provides access to three functions:

  • Maintains the registry for all modules in tX
  • Authorization for requests via the tx-auth module
    • Accepts user credentials via HTTP Basic Auth (over HTTPS)
    • Counts requests made by each token
    • Blocks access if requests per minute reaches a certain threshold
  • Handles the public API paths that modules register
  • Job queue management and rendered file presentation

tX Authorization Module

The tX Authorization Module is an authorization module for the tX system. In reality, this is just the python-gogs-client. The tx-manager module uses it to perform authorization of request. The module handles the following:

  • Grants access to the API based on Gogs tokens

tX Conversion Modules

Each conversion module accepts a specific type of text format as its input and the module returns a specific type of output document. For example, there is a md2pdf module that converts Markdown text into a rendered PDF. The conversion modules also require that you specify the resource type, which affects the formatting of the output document.

Input Format Types

There are currently two accepted input format types:

  • Markdown -md
  • Unified Standard Format Markers - usfm

A few notes on input formatting:

  • Conversion modules do not do pre-processing of the text. The data supplied must be well formed.
  • Conversion modules expect a single file either:
    • A plaintext file of the appropriate format (md or usfm).
    • A zip file with multiple plaintext files of the appropriate format.

In the case of a zip file, the conversion module should process the files in alphabetical order. According to our obs file naming convention and the usfm standard, this process should yield the correct output in both cases.

Output Format Types

For each type of input format, the following output formats are supported:

  • PDF - pdf
  • DOCX - docx
  • HTML - html

Resource Types

Each of these resource types affects the expected input and the rendered output of the text. The recognized resource types are:

  • Open Bible Stories - obs
  • Scripture/Bible - bible
  • translationNotes - tn
  • translationWords - tw
  • translationQuestions - tq
  • translationAcademy - ta

Available Conversion Options

Conversion modules specify a list of options that they accept to help format the output document. Every conversion module MUST support these options:

  • "language": "en" - Defaults to en if not provided, MUST be a valid IETF code, may affect font used
  • "css": "http://some.url/your_custom_css" - A CSS file that you provide. You can override or extend any of the CSS in the templates with your own values.

Conversion modules MAY support these options:

  • "columns": [1, 2, 3, 4] - Not available for obs input
  • "page_size": ["A4", "A5", "Letter", "Statement"] - Not available for HTML output
  • "line_spacing": "100%"
  • "toc_levels": [1, 2, 3, 4, ...] - To specify how many heading levels you want to appear in your TOC.
  • "page_margins": { "top": ".5in","right": ".5in","bottom": ".5in","left": ".5in" } - If you want to override the default page margins for PDF or DOCX output.

Deploying Modules

Each module is initially deployed to AWS Lambda via the apex command. After this, Travis CI is configured to manage continuous deployment of the module (see docs at https://docs.travis-ci.com/user/deployment/lambda).

Continuous deployment of the module should be setup such that:

  • the master branch is deployed to production whenever it is updated
  • the develop branch is deployed to staging whenever it is updated

The deployment process looks like this:

  • Code in progress lives in a feature-named branch until the developer is happy and automated tests pass.
  • Code is peer-reviewed, then
  • Merged into develop until automated testing passes and it integrates correctly in staging.
  • Merged into master which triggers the auto-deployment

Registering a Module

Every module (except tx-manager) MUST register itself with tx-manager. A module MUST provide the following information to tx-manager:

  • Public endpoints (for tx-manager to present)
  • Private endpoints (will not be published by tx-manager)
  • Module type (one of conversion, authorization, utility)

A conversion module MUST also provide:

  • Input format types accepted
  • Output format types accepted
  • Resource types accepted
  • Conversion options accepted

Example registration for md2pdf:

Request

POST https://api.door43.org/tx/module

{
    "name": "tx-md2pdf_convert",
    "version": "1",
    "type": "conversion",
    "resource_types": [ "obs", "bible" ],
    "input_format": [ "md" ],
    "output_format": [ "pdf" ],
    "options": [ "language", "css", "line_spacing" ],
    "private_links": [ ],
    "public_links": [
        {
            "href": "/md2pdf",
            "rel": "list",
            "method": "GET"
        },
        {
            "href": "/md2pdf",
            "rel": "create",
            "method": "POST"
        },
    ]
}

Response:

201 Created

{
    "name": "md2pdf",
    "version": "1",
    "type": "conversion",
    "resource_types": [ "obs", "bible" ],
    "input_format": [ "md" ],
    "output_format": [ "pdf" ],
    "options": [ "language", "css", "line_spacing" ],
    "private_links": [ ],
    "public_links": [
        {
            "href": "/md2pdf",
            "rel": "list",
            "method": "GET"
        },
        {
            "href": "/md2pdf",
            "rel": "create",
            "method": "POST"
        },
    ]
}

tX Webhook Client

The tX Webhook Client is a client to tX. The purpose of this client is to pre-process the git repos from Gogs' webhook notifications, send them through tX, and upload the resulting HTML files to the door43.org site. The process looks like this:

  • Accepts the default webhook notification from git.door43.org
  • Gets the data from the repository (via HTTPS request)
  • Identifies the Resource Type (via name of repo or manifest.json file)
  • Formats the request (turns the repo into valid Markdown or USFM, then creates a zip file)
  • Sends the valid data (in zip format) through tX, requesting HTML output
  • Gets the resulting HTML file and uploads it to the door43.org S3 bucket

Including Python Packages in a Lambda Function

Requirements for a Python script need to reside within the function's directory that calls them. A requirement for the convert function should exist within functions/convert/.

The list of requirements for a function should be in a requirements.txt file within that function's directory, for example: functions/convert/requirements.txt.

Requirements must be installed before deploying to Lambda. For example:

pip install -r functions/convert/requirements.txt -t functions/convert/

The -t option tells pip to install the files into the specified target directory. This ensures that the Lambda environment has direct access to the dependency.

If you have any Python files in subdirectories that also have dependencies, you can import the ones available in the main function by using sys.path.append('/var/task/').

Lastly, if you install dependencies for a function you need to include the following in an .apexignore file:

*.dist-info

For Reference

There is a similar API that has good documentation at https://developers.zamzar.com/docs. This can be consulted if we run into blockers or need examples of how to implement tX.

Clone this wiki locally