Skip to content
Kent Rasmussen edited this page Sep 26, 2024 · 27 revisions

Todo - AZT

A-Z+T 1.0 release stoppers

This is the A−Z+T feature set required for largely independent use from start to Alphabet chart, to tone data collection

  • Output Alpabet chart with
    • one default set of characters to display (all consonant and vowel groups, which show up anywhere)
    • one default alphabet order (from Unicode/ANSI/ASCII?)
    • one default layout (maybe math from 8.5/11-ish, so it depends on how many letters to show)
    • a means to select words (pictures having already been selected)
  • Decide how to address sorting and reporting with obligatory morphology, implement and document for users.
  • French translation

A-Z+T 1.0 Bugs

  • Figure out why CS isn't working (not in settings?)
  • Segment interpretation settings saving and operating as expected!!
  • under the hood
    • self.verifybutton
    • removesenseidfromgroup
  • user interface
    • Consider removing group numbers from UI. Call it 'this group'?
  • Segment reports by sort groups (not current form)
  • figure out sorting by root position (even if full implementation for v1.1)
  • figure out filtering by noun/verb class (even if full implementation for v1.1)
  • decide what to do with ad hoc groups, either to maintain (and update), or drop altogether
  • fix add words crashing on no words
  • fix no error on new language when directory already exists

Version 1.1+

Tone Frame exemplification

  • put real lexical examples, in place of < x lang word>
  • put 'change word' button on the top.
    • will need to work in the case of asking for gloss languages not in the lexicon
      • just list in this case?
      • unselect least populous gloss? I don't like this
      • list upfront the number of glosses in each lang: do this in any case
        • maybe also the number of entries with this, that, and the other glosses.
  • finish each line with a colon (?), followed by example ?as currently formatted
    • Not sure how well this would work

New Reports

  • pssubclass report
  • alphabet report/chart

Data collection

  • consider collecting recordings first, then first transcription.

Papercuts

  • pick up images for demo dbs doesn't seem to be consistently working

  • Read and write lift/header attributes creator, date?

  • Not die from beginning if git user info isn't specified

  • New project needs to have git init, or else not die

  • Git ignore must be complete before first commit, it will take in unwanted files!

  • keep taskchooser from showing up on reports

  • test sound function thoroughly

  • Make change gloss task

  • make join/rename ps task

  • Handle capitalization well (where users input words with capitals)

  • Should I pull NC from digraphs, and rely on NC=C setting?

  • Look at using "aa' as a digraph, v using VV=V setting do they both give the same v groups? Which one is better in terms of syllable profiles? (They should be the same...)

  • Consider putting AZT repo urls in AZT folder, not language data settings file (unlike the same settings for language data)

  • 'Please wait' has black bars in left and bottom in Windows

  • Solicit languages (from users or the general populus) that need other scripts

  • Incorporate GSRfL stylesheets to AZT

    • Catch up XLP stylesheets to GSRfL
  • Try out different fonts, and font variations, in fonts.py, to see which ones work. See if we can run azt in just those fonts, and not have to switch between charis and andika*

Document background preparation to do

  • Look at parts of speech, and decide what are likely good frames
  • Either based on the family, related languages, or on what is known if the language itself.
  • Think through recording needs and equipment, including environment and training required.

Hardware

  • Can we facilitate the purchase of a lot for people who can front the money?
    • Pi w keyboard: check out current options for CPU, ram
    • Projector usb-C, ?with battery backup
    • Battery: large capacity, usb-C
    • <$200
    • Mouse

Transcriber

  • pause between syllables
  • There's a difference between words, makes shakes now pronounced
  • inappropriate tweak between adjacent letters of same value (should be even tone)

Project organization and status

Table somewhere to report status on a higher level

  • ps v profile
  • some 'done' value for each of C, V, and tone
    • Sort: number of checks with
      • no tosort
      • all verified groups
      • no tojoin
    • Name: number of checks with
      • no integer groups
    • Record: number of checks with
      • recorded True?
  • if group names listed, should be in same order
  • allows click to go there (could replace task chooser)

Consider a lift merge function in lift.py,

  • take a three way merge between lift files, maybe with origin.
  • For each guide, check everything not in a sense, for each sense check is intervals, except for identifiers for examples.
  • Wherever something has added to one, add to outcome. Do easy stuff first, then limit to others.
  • The stuff I was looking at today should be done automatically.*

Alphabet chart example selection

  • start with pictured, then with S1=S2, etc
  • Give a scrollbar of buttons, one for each letter
  • Letter buttons organized like group buttons, to scroll through them until you like one
  • Ultimately output to Jason or whatever the alpha program uses, also to file for loading later to keep working, and for posterity

AZT 2.0 step 1

  • On clicking "sort", put up a brief set of instructions with a QR code, instructions everyone to go there on their phones.
  • Sort then, according to 2.0 rules. Until everyone is happy, then move on
  • At first, should probably click for each slice of sorting
  • Could have in instruction wall what is being sorted on, in case anyone gets lost, or forgets the basic rules

Starting a database in a new related language

  • Add second language
  • Keep first language as a second analysis language, or as a gloss? (Have both as options?)
  • Once second analysis is added, can we just switch and continue?
  • Think about outputs, including comparison tables
  • Long term, do I want this as an option (to analyze multiple languages in one database)
  • Or should I rather work on tools to help comparison?
  • Dialect analysis considerations
    • Glosslang frames could not change, neither in the frame, nor in a given example (without causing problems for the other analang, as there is just one value per glosslang)
    • so if name and glossing are not the same, make a different frame
    • This would only apply in a multilingual dictionary when multiple languages have forms in an entry, sharing one or more sense. In this case, glossing should be the same, though all the form fields (lx,lc,pl,imp) could have different values by language
      • if two languages do not share an entry, they won't share and senses (and therefore won't share any examples), and can have independent glossing.
    • Not sure if multiple senses just for sense variation between dialects is a good idea.
      • I don't know that there's a way to show which sense goes with which language.
        • could have one sense with forms in one language, and another sense with forms in the other.
          • this would require robust logic to not die on a sense missing a language form.
      • ?So if you need different definitions or glossing, you probably want different entries; bummer for comparative work.
    • Also, if we elicit one language through another, at least some of those entries will require (sooner or later) at least tweaks to the glossing/defns. So whatever UI does this we need a way to split an entry by language when doing so, to preserve the original glossing for the other language.
      • 'which language do you want to change the gloss for?'
    • Set up frames for dialect analysis
      • We can use the same tone frames if they work in each dialect, or
        • A frame would have multiple analang forms, which would simply be ignored when not requested (as either analang or glosslang for the other)
      • simply define other frames, and some will have some languages, and others would have others.
      • This would mean that frames would continue to not be coded for language, other than storing language frame coded by analang code.
      • To do this, would need a modify frame page, to add new analang data (should keep glosses unmodifiable)
        • This same page could change the frame name
    • I should need to think through how to do multiple analangs in reports

Installation issues

  • Why is numpy not installing for win_amd64?
  • Copy modules to AZT modules dir for windows users
  • Figure out how to add icon to windows shortcut
    • may be a naming character problem

Sort Orders

  • How to tell the computer in UI which should precede others

    • Showing in order
    • Where to store that

    Frame Editor

    • Change frame name?
    • Would need to iterate across lift, for all profiles
    • Do I want to allow modification of frame content?
    • Copy of existing frame? Error on same name....
    • Sourced by window listing frame names and forms
    • Name change in toneframe, status, and lift
    • N.B. : this will be powerful, and to be done with care, at it will change the location field for entries in your database.
    • Maybe add a check against doing this when it would damage something?
    • Lots of frames; is 24 much?
      • How to investigate these questions without multiplying so many frames?
      • How to show only relevant frames
    • Header buttons should hide the column or row, come back on a "show all" button in corner

Mostly done stuff

Remaining parse fn

  • once second form is given, if parse is rejected, offer another parse with shorter root.
    • preserve set of root hypotheses,
      • maybe order them, and
      • ask about them in order
      • or just exclude what was rejected, and evaluate for the best again.

UI issues

  • Parse page needs to ask about nouns and verbs at the same time
    • Have a page with two scrolling columns, for noun options and verb options.
      • For each, list options to select, populated from options built from known prefixes and suffixes for citation and plural or imperative (add and remove as appropriate).
    • be ready for a word that fits both (2 senses w zero derivation)
    • pl could be a pseudonym for whatever secondary form shows nounhood, imp for whatever secondary form shows verbhood
      • should document this usefully somehow
  • offer list of options to select
    • User selects the correct second form, and AZT does all the calculation behind the scenes, including populating lc, pl, and/or imp, modifying/creating lx, and setting the part of speech --then presenting the next word.
  • Will need to have an "other" button.
    • AZT also has a "none of these" button under each column, which brings up another window, just type in secondary form.
    • NO:activates automatically if nothing parses off of citation form
    • gives a fill in option for second form (and ps?)
    • will need a repair strategy if a bad morpheme set gets saved, or a bad ps
      • maybe load each time, so values not present get lost (once fixed)
      • fn to rebuild affix database (no UI for this yet)
    • This second page will do the same manipulation of lift file, but also, all this info gets stored in the known affixes file, for use in presenting options to the user for future words.
  • Do ALL parsing behind the scenes
  • If parsing after giving second form, don't offer selection of second form
  • Don't do this:
    • What letters don't change between the plural and singular forms? Button for range (1, len(x)) letters missing from front and back, then all combos, right? Scroll this list, as there will be many options. When user selects one, a second frame opens with that root surrounded by prefix and plural boxes, asking the user what other letters are needed to build the plural/imperative form? Also toggle for which form it is we're building.

New settings (for morphology, sensitive to subcategories)

  • Store pairs of affixes from object

Functional Structure

  • Word Collection (lc)
  • Record (lc)
  • ToneFrames (lc)
  • Syllable Profile Analysis (lc)
    • SortV (lc)
    • SortC (lc)
    • SortT (lc)
      • RecordT (lc)
      • TranscribeT (lc)
  • Parsing functions
    • Either of
      • ParseB (if clickable second form)
      • Collect 2nd forms (pl/imp)
        • ParseA (on the basis of typed second form)
    • produces these outputs:
      • output: lx analysis
      • output: second form (clicked or typed)
      • output: part of speech
    • allows these functions:
      • Record second forms
      • Syllable Profile Analysis (lx/pl/imp)
        • SortV (lx/pl/imp)
        • SortC (lx/pl/imp)
        • ToneFrames (lx/pl/imp)
          • SortT (lx/pl/imp)
            • RecordT (lx/pl/imp)
            • TranscribeT (lx/pl/imp)

Parsing objects/classes

Sort&Segments

  • Parsing needs to happen before CV sorting, but the the forms need to be kept in sync during sorting, or else the parsing will need to be redone
    • CV parse: replace s in multiple forms - Use rx.split() re.sub, etc
    • CV: Look at string replacement methods, see if any work for number of occurrences
    • Use new and old values for replacement
      • use replace to take changes from one and put to the other?
      • Keep lx up to date with lc, pl imp
      • Try string.split in the old form, join with old and new as delimiters, as appropriate (new.join(t.split(sep=old)) works)
  • Do something with stem type field

Affix collector (do often and cheaply)

  • find common and uncommon segments of lx/lc/pl/imp
    • on boot
    • on parse
  • Assume no reduplication
    • maybe offer root twice in succession, if fails at first?
  • collect affixes only from best data:
    • Ideal (level 4): lx is subset of first and second forms
      • form.split(sep=root) (assuming a good parse) gives known affixes for each
      • known affixes are already paired for that (correct) ps
    • set parser auto level above the default 4 to parse these manually (e.g., if all the information is complete and consistent, but nonetheless incorrect).

Parser (different method of same class as above)

  • Ideal-1: but missing or other ps
  • Ideal-2: unpaired known affixes
  • Ideal-3: only one known affix
  • ideal-4: no known affixes, or no lcs (including missing second form to compare)
    • Don't do anything with this, except by explicit user request, rather, suggest possibly correct forms for the user to pick from.
  • Only two forms:
    • No lc (i.e., just lx and second):
      • assume lx is pronouncable lc, move over and do 'no lx' below.
    • No lx (i.e., parsing not already done):
      • Ideal-: lx missing, but longest common string for 1st-2nd forms
        • is >50% of forms
        • can we come up with a more sane test for parse sanity?
        • should we calculate affixes, but also have them stored in settings?
    • No second form (not collected, one way or another):
      • parse between lx and lc only
      • suggest 2nd form possibilities w/o pairings
      • or we could skip this
        • and/or mark for parsing later

Affix storage catalog

  • By ps (nouns v verbs)
    • Make subcategories used in ps profile logic, including all option
      • start hypersplit, get joined later (from ps-profile)
    • profile not needed, at least at first (maybe V init/final would provide different options than C init/final, but we would need to be careful until we know that was working)
    • list found tuples of tuples:
      • each noun combination is (lc, pl)
      • each verb combination is (lc, imp)
      • tuple for lc/pl/imp is (prefix, suffix)
        • [0] is pfx, [1] is suffix:
          • if len()>2, error msg
          • output this tuple to a tuple as above to preserve correspondences
          • should expect x:y relationship between lc affixes and second form affixes (so a simple list or dict wouldn't capture it)
      • e.g., one noun entry might result in (('li',),('di',))
      • from this list could derive list of affix tuples for
        • any first form [i[0] for i in list]
        • any second form [i[1] for i in list]
        • all three lists could be compiled with collections.Counter([iterable-or-mapping])
          • most popular would be .most_common(1)
      • This means we don't have to track correspondence between lc affixes and pl/imp affixes in the settings for users to fix/screw up
        • track in LIFT pssubclass

Parser: access and store affix values in each LIFT sense

  • Each sense will already be marked for ps, read that
  • NO:Field[@type=' affixes']/form[@lang=analang]
  • in 'trait[@name="{}-infl-class"]'.format(self.psvalue())
  • values marked for placement by -x/x-, so the parser could pick it up and use it correctly in tuples (pfx,sfx).
  • Value will be picked up by affixes object
  • Value stored by parser object (via affix object method?)
    • in LIFT
    • in affix catalog

Method to draft root

  • Import method first (build confirmed affixes)
  • Known affix method
    • This should cover use cases where lidata and didata would parse just l/d-, as would start with known affixes, which could be set just once.
  • lcs method last (bc requires user confirmation, will be required for all new affixes)
    • find largest overlaps between two forms (difflib.SequenceMatcher(t,t2).find_longest_match())
      • This will collect common affix segments to the root
      • subset this to find all (analyzable) root possibilities
    • try to build actual forms with known affixes on these root hypotheses
  • Evaluate roots Not sure about this
    • Filter prefixes by form.startswith(afx), suffixes w endswith(afx)
    • No:afx can be string or tuple of strings (maybe test that first?)
    • With two forms, do each, and if any from the one matches one from the other, assume that's your root Mark how many, track quality of match:
      • each form built on known affixes
      • known affixes for each form already present together
      • ps match on second form

Parser: parse from two forms

  • This will not change the data fields. It only populates/changes the analysis in the lx field and the affix fields, so mistakes here should always be recoverable.
  • Need to parse lx from lc and (pl or imp).
  • Populate affix fields (by calling or running before affix collector)
  • if lx (or lc) and second form, but no affix defns:
    • If no lc, move lx to lc. This assumes lx was data, not analysis
    • draft root

Procedure for one form only:

  • load affix collector
  • Make list of root hypotheses: form less affixes for that form (lc or second)
    • process with with affix tuples correspondences, to put on second form
    • ?:I don't expect to have second forms where no lc exists, but should probably plan for this to happen eventually
  • For each (assuming list is nonzero)
    • For each other form (from above)
      • suggest rh plus other form affixes
  • present "None of the Above Button" >
  • Take user input, and store it.
  • lx should be a subset of lc, and maybe pl and imp

Method to build forms (check draft root against (two) forms)

  • Iterate over each lc known affix
    • Check if root+affix =lc
  • Iterate over each known pl/imp affix
    • Check if root-affix= form
  • If drafted root builds lc&2nd with known affixes, 100% +2 certainty
  • evaluate and act on lesser certainty
    • new affixes
    • segments common to two affixes: this is hard to show without a third form, or at least knowledge of the language family
      • this should already be dealt with?
      • If two different roots build forms on known affixes, the smaller root is taken.
        • prioritize known larger affixes over larger roots.
    • suppletive root
    • At some point we will need a user (who?) to confirm parsing, wherever a certainty threshold isn't reached:
      • a new affix is found
      • once second form is there, present parsing for confirmation (according to settings for confirmation and auto-parsing)

~~## Parse I (should be part of parse UI – one option)

  • copy lc > pl/imp
  • this assumes NO obligatory morphology
  • this allows NO check for ps~~

~~## Parse II (Questionable value)

  • select 2 cuts
  • parse lc with gui buttons
  • maybe suggest pl/imp affixes on each?~~

Parse III

  • only works where lc has known affixes
  • GUI still requires "other" button
  • Access stored pairs of affixes
  • method to see which are possibly present in lc
  • construct possible alternate (pl/imp) forms for presentation to user

A-Z+T 1.1

General efficiency issues

  • Remove all build fns, replace with try blocks

Make lift class more OO

  • Method to return senses, etc as objects, use those internally and elsewhere
  • Find and modify object nodes, rather than getting and putting text
    • How to manage the question of when you make the empty node?
      • if logic assumes a node object to write to, no object could break it.
      • create object on populate (extra try block)
      • If it doesn't get filled with info, are we just creating lots of empty nodes?
      • we could maybe have an attribute to mark modified, remove those without?
        • when would this be? not on all lift writes, as they are frequent...
  • How to superclass XML nodes?
  • Use classes for entry, sense, example, etc
  • Distinguish examples by language form@lang/text
    • (Location and?) tone form fields should be coded by analang, so multiple languages can store distinguishable data in the same example (is this a good idea?)
    • use annotationlang
    • Think through whether we want a frame and/or example with a different name for each language... Probably not
    • Forms to search should be lexemes, so we get root C and V positioning
      • We really should optimize this better, if it is to be run often...

How much of the formatting object is really necessary?

  • What could be replaced by methods? ALL
  • Methods could ask or set, as with other methods
  • Would methods be more or less efficient? MUCH MORE

What other modules could should be able to be run independently?

Clone this wiki locally