Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case: fstalign ignores symbols table and does not find alignment in a simple transcript #40

Open
niedakh opened this issue Feb 15, 2023 · 4 comments

Comments

@niedakh
Copy link

niedakh commented Feb 15, 2023

Hi,

fstalign 1.6.1 does not load fst symbol tables properly and modifies the hypothesis FST so it's completely borked:

Here is the output of the command /fstalign/bin/fstalign wer --ref /data/customer/ref.txt --hyp /data/customer/hyp.nlp --symbols /data/customer/hyp.sym --output-sbs /data/customer/res.sbs --log /data/customer/res.log in the current docker. It happens both with txt file (one gold transcript word per line) and ctm with time aligned gold transcript.

[2023-02-16 10:51:17.854] [console] [info] loggers initialized
[+++] [10:51:17] [console] fstalign version is 1.6.1
[+++] [10:51:17] [console] reading reference plain text from /data/customer/ref.txt
[+++] [10:51:17] [console] reading hypothesis fst from /data/customer/hyp.fst
[+++] [10:51:17] [fstalign] starting conversion to int vector
[+++] [10:51:17] [fstalign] converting ref to int vector
[+++] [10:51:17] [OneBestFstLoader] creating std::vector<int> for OneBestFstLoader for 27 tokens
[+++] [10:51:17] [fstalign] converting hyp to int vector
[+++] [10:51:17] [FstFileLoader] convertToIntVector isn't implemented for FST inputs
[+++] [10:51:17] [fstalign] Either ref or hyp is really small, skipping over the levenstein distance,  ref size: 27, hyp size: 0
[+++] [10:51:17] [FstFileLoader] Total FST has 27 states.
[+++] [10:51:17] [fstalign] generating ref synonyms from symbol table
[+++] [10:51:17] [fstalign] applying ref synonyms on ref fst
[+++] [10:51:17] [SynonymEngine] we have 0 registered first word rules label id
[+++] [10:51:17] [fstalign] printing ref fst
[+++] [10:51:17] [fstalign] 0	1	8/hello	8/hello	0.0
[+++] [10:51:17] [fstalign] 1	2	9/i'm	9/i'm	0.0
[+++] [10:51:17] [fstalign] 2	3	10/fine	10/fine	0.0
[+++] [10:51:17] [fstalign] 3	4	11/suzana	11/suzana	0.0
[+++] [10:51:17] [fstalign] 4	5	12/how	12/how	0.0
[+++] [10:51:17] [fstalign] 5	6	13/are	13/are	0.0
[+++] [10:51:17] [fstalign] 6	7	14/you	14/you	0.0
[+++] [10:51:17] [fstalign] 7	8	15/mhm	15/mhm	0.0
[+++] [10:51:17] [fstalign] 8	9	16/sure	16/sure	0.0
[+++] [10:51:17] [fstalign] 9	10	17/yes	17/yes	0.0
[+++] [10:51:17] [fstalign] 10	11	18/okay	18/okay	0.0
[+++] [10:51:17] [fstalign] 11	12	19/ah	19/ah	0.0
[+++] [10:51:17] [fstalign] 12	13	20/just	20/just	0.0
[+++] [10:51:17] [fstalign] 13	14	21/a	21/a	0.0
[+++] [10:51:17] [fstalign] 14	15	22/couple	22/couple	0.0
[+++] [10:51:17] [fstalign] 15	16	23/of	23/of	0.0
[+++] [10:51:17] [fstalign] 16	17	24/minutes	24/minutes	0.0
[+++] [10:51:17] [fstalign] 17	18	9/i'm	9/i'm	0.0
[+++] [10:51:17] [fstalign] 18	19	25/on	25/on	0.0
[+++] [10:51:17] [fstalign] 19	20	26/my	26/my	0.0
[+++] [10:51:17] [fstalign] 20	21	27/way	27/way	0.0
[+++] [10:51:17] [fstalign] 21	22	28/into	28/into	0.0
[+++] [10:51:17] [fstalign] 22	23	21/a	21/a	0.0
[+++] [10:51:17] [fstalign] 23	24	29/doctor's	29/doctor's	0.0
[+++] [10:51:17] [fstalign] 24	25	30/appointment	30/appointment	0.0
[+++] [10:51:17] [fstalign] 25	26	18/okay	18/okay	0.0
[+++] [10:51:17] [fstalign] 26	27	18/okay	18/okay	0.0
[+++] [10:51:17] [fstalign] 27	28	0/<eps>	0/<eps>	0.0
[+++] [10:51:17] [fstalign] printing hyp fst
[+++] [10:51:17] [fstalign] 0	1	23/of	23/of	0.99158907
[+++] [10:51:17] [fstalign] 0	1	24/minutes	24/minutes	0.008410932
[+++] [10:51:17] [fstalign] 1	2	25/on	25/on	0.8511446
[+++] [10:51:17] [fstalign] 1	2	2/<ins>	2/<ins>	0.05130972
[+++] [10:51:17] [fstalign] 1	2	1/<oov>	1/<oov>	0.047072377
[+++] [10:51:17] [fstalign] 1	2	26/my	26/my	0.0155186895
[+++] [10:51:17] [fstalign] 1	2	27/way	27/way	0.005192776
[+++] [10:51:17] [fstalign] 1	2	28/into	28/into	0.0033090003
[+++] [10:51:17] [fstalign] 1	2	23/of	23/of	0.0015563301
[+++] [10:51:17] [fstalign] 2	3	29/doctor's	29/doctor's	0.3886394
[+++] [10:51:17] [fstalign] 2	3	30/appointment	30/appointment	0.19079825
[+++] [10:51:17] [fstalign] 2	3	31/	31/	0.10623746
[+++] [10:51:17] [fstalign] 2	3	3/<del>	3/<del>	0.04679449
[+++] [10:51:17] [fstalign] 2	3	2/<ins>	2/<ins>	0.034456342
[+++] [10:51:17] [fstalign] 2	3	25/on	25/on	0.0081548225
[+++] [10:51:17] [fstalign] 2	3	32/	32/	0.004374613
[+++] [10:51:17] [fstalign] 2	3	26/my	26/my	0.002221351
[+++] [10:51:17] [fstalign] 2	3	1/<oov>	1/<oov>	0.002075816
[+++] [10:51:17] [fstalign] 2	3	33/	33/	0.0019667444
[+++] [10:51:17] [fstalign] 2	3	34/	34/	0.0005650157
[+++] [10:51:17] [fstalign] 3	4	4/<sub>	4/<sub>	0.69604653
[+++] [10:51:17] [fstalign] 3	4	3/<del>	3/<del>	0.15710989
[+++] [10:51:17] [fstalign] 3	4	30/appointment	30/appointment	0.014679594
[+++] [10:51:17] [fstalign] 3	4	35/	35/	0.011887497
[+++] [10:51:17] [fstalign] 3	4	36/	36/	0.0065641715
[+++] [10:51:17] [fstalign] 3	4	2/<ins>	2/<ins>	0.0056938045
[+++] [10:51:17] [fstalign] 3	4	25/on	25/on	0.0021069725
[+++] [10:51:17] [fstalign] 3	4	26/my	26/my	0.001466121
[+++] [10:51:17] [fstalign] 3	4	29/doctor's	29/doctor's	0.0013138258
[+++] [10:51:17] [fstalign] 3	4	1/<oov>	1/<oov>	0.0012702389
[+++] [10:51:17] [fstalign] 3	4	17/yes	17/yes	0.0009687771
[+++] [10:51:17] [fstalign] 4	5	5/<inaudible>	5/<inaudible>	0.5213346
[+++] [10:51:17] [fstalign] 4	5	37/	37/	0.17481348
[+++] [10:51:17] [fstalign] 4	5	38/	38/	0.14042015
[+++] [10:51:17] [fstalign] 4	5	36/	36/	0.053299483
[+++] [10:51:17] [fstalign] 4	5	3/<del>	3/<del>	0.042188246
[+++] [10:51:17] [fstalign] 4	5	39/	39/	0.011979131
[+++] [10:51:17] [fstalign] 4	5	2/<ins>	2/<ins>	0.0036785
[+++] [10:51:17] [fstalign] 4	5	35/	35/	0.002629472
[+++] [10:51:17] [fstalign] 4	5	30/appointment	30/appointment	0.0022271618
[+++] [10:51:17] [fstalign] 4	5	40/	40/	0.002156982
[+++] [10:51:17] [fstalign] 4	5	41/	41/	0.0016741576
[+++] [10:51:17] [fstalign] 5	6	6/<silence>	6/<silence>	0.6163309
[+++] [10:51:17] [fstalign] 5	6	42/	42/	0.18521181
[+++] [10:51:17] [fstalign] 5	6	36/	36/	0.056179322
[+++] [10:51:17] [fstalign] 5	6	43/	43/	0.038890716
[+++] [10:51:17] [fstalign] 5	6	44/	44/	0.0326784
[+++] [10:51:17] [fstalign] 5	6	3/<del>	3/<del>	0.026641503
[+++] [10:51:17] [fstalign] 5	6	45/	45/	0.020304155
[+++] [10:51:17] [fstalign] 5	6	46/	46/	0.006060596
[+++] [10:51:17] [fstalign] 5	6	5/<inaudible>	5/<inaudible>	0.0041220332
[+++] [10:51:17] [fstalign] 5	6	30/appointment	30/appointment	0.0033203475
[+++] [10:51:17] [fstalign] 5	6	47/	47/	0.0029174143
[+++] [10:51:17] [fstalign] 6	7	48/	48/	0.34014535
[+++] [10:51:17] [fstalign] 6	7	24/minutes	24/minutes	0.2984986
[+++] [10:51:17] [fstalign] 6	7	49/	49/	0.19404508
[+++] [10:51:17] [fstalign] 6	7	10/fine	10/fine	0.016649699
[+++] [10:51:17] [fstalign] 6	7	0/<eps>	0/<eps>	0.013604832
[+++] [10:51:17] [fstalign] 6	7	50/	50/	0.0062978663
[+++] [10:51:17] [fstalign] 6	7	51/	51/	0.0049225087
[+++] [10:51:17] [fstalign] 6	7	52/	52/	0.0039882683
[+++] [10:51:17] [fstalign] 6	7	23/of	23/of	0.0033028126
[+++] [10:51:17] [fstalign] 6	7	53/	53/	0.0029480883
[+++] [10:51:17] [fstalign] 6	7	54/	54/	0.002575561
[+++] [10:51:17] [fstalign] 7	8	55/	55/	0.43735883
[+++] [10:51:17] [fstalign] 7	8	8/hello	8/hello	0.40650827
[+++] [10:51:17] [fstalign] 7	8	56/	56/	0.038571022
[+++] [10:51:17] [fstalign] 7	8	57/	57/	0.010218942
[+++] [10:51:17] [fstalign] 7	8	58/	58/	0.009362684
[+++] [10:51:17] [fstalign] 8	9	59/	59/	0.81295073
[+++] [10:51:17] [fstalign] 8	9	60/	60/	0.05420902
[+++] [10:51:17] [fstalign] 8	9	61/	61/	0.02045335
[+++] [10:51:17] [fstalign] 8	9	24/minutes	24/minutes	0.018062603
[+++] [10:51:17] [fstalign] 8	9	54/	54/	0.012383968
[+++] [10:51:17] [fstalign] 8	9	62/	62/	0.007653652
[+++] [10:51:17] [fstalign] 9	10	10/fine	10/fine	1.0
[+++] [10:51:17] [fstalign] 10	11	11/suzana	11/suzana	0.7825733
[+++] [10:51:17] [fstalign] 10	11	35/	35/	0.10785312
[+++] [10:51:17] [fstalign] 10	11	63/	63/	0.09928422
[+++] [10:51:17] [fstalign] 10	11	13/are	13/are	0.010289361
[+++] [10:51:17] [fstalign] 11	12	12/how	12/how	1.0
[+++] [10:51:17] [fstalign] 12	13	13/are	13/are	1.0
[+++] [10:51:17] [fstalign] 13	14	14/you	14/you	1.0
[+++] [10:51:17] [fstalign] 14	15	15/mhm	15/mhm	1.0
[+++] [10:51:17] [fstalign] 15	16	16/sure	16/sure	1.0
[+++] [10:51:17] [fstalign] 16	17	1/<oov>	1/<oov>	0.9930773
[+++] [10:51:17] [fstalign] 16	17	35/	35/	0.006035227
[+++] [10:51:17] [fstalign] 16	17	64/	64/	0.0008874871
[+++] [10:51:17] [fstalign] 17	18	17/yes	17/yes	0.99649245
[+++] [10:51:17] [fstalign] 17	18	65/	65/	0.003507566
[+++] [10:51:17] [fstalign] 18	19	18/okay	18/okay	1.0
[+++] [10:51:17] [fstalign] 19	20	19/ah	19/ah	1.0
[+++] [10:51:17] [fstalign] 20	21	20/just	20/just	0.8747955
[+++] [10:51:17] [fstalign] 20	21	66/	66/	0.08273692
[+++] [10:51:17] [fstalign] 21	22	13/are	13/are	0.879016
[+++] [10:51:17] [fstalign] 21	22	39/	39/	0.052434582
[+++] [10:51:17] [fstalign] 21	22	66/	66/	0.029785942
[+++] [10:51:17] [fstalign] 21	22	67/	67/	0.0126816565
[+++] [10:51:17] [fstalign] 21	22	68/	68/	0.00951749
[+++] [10:51:17] [fstalign] 21	22	69/	69/	0.007649917
[+++] [10:51:17] [fstalign] 21	22	35/	35/	0.0052877315
[+++] [10:51:17] [fstalign] 21	22	70/	70/	0.0021538541
[+++] [10:51:17] [fstalign] 21	22	71/	71/	0.0014728603
[+++] [10:51:17] [fstalign] 22	23	21/a	21/a	0.9477601
[+++] [10:51:17] [fstalign] 22	23	72/	72/	0.052239873
[+++] [10:51:17] [fstalign] 23	24	22/couple	22/couple	1.0
[+++] [10:51:17] [fstalign] 24	25	10/fine	10/fine	0.9680188
[+++] [10:51:17] [fstalign] 24	25	54/	54/	0.02020042
[+++] [10:51:17] [fstalign] 24	25	56/	56/	0.011780825
[+++] [10:51:17] [fstalign] 25	26	10/fine	10/fine	1.0
[+++] [10:51:17] [walker] starting a walk in the park
[+++] [10:51:17] [walker] we have 0 candidates after 28 loops
[+++] [10:51:17] [fstalign] done walking the graph
terminate called after throwing an instance of 'std::runtime_error'
  what():  no alignment produced
Aborted                 (core dumped)

The proper FST is however:

0	1	0	0	0.991589
0	1	1	1	0.00841093
1	2	2	2	0.851145
1	2	3	3	0.0513097
1	2	4	4	0.0470724
1	2	5	5	0.0155187
2	3	6	6	0.388639
2	3	7	7	0.190798
2	3	8	8	0.106237
2	3	9	9	0.0467945
3	4	10	10	0.696047
3	4	9	9	0.15711
3	4	7	7	0.0146796
3	4	11	11	0.0118875
4	5	12	12	0.521335
4	5	13	13	0.174813
4	5	14	14	0.14042
4	5	15	15	0.0532995
5	6	16	16	0.616331
5	6	17	17	0.185212
5	6	15	15	0.0561793
5	6	18	18	0.0388907
6	7	19	19	0.340145
6	7	1	1	0.298499
6	7	20	20	0.194045
6	7	21	21	0.0166497
7	8	22	22	0.437359
7	8	23	23	0.406508
7	8	24	24	0.038571
7	8	25	25	0.0102189
8	9	26	26	0.812951
8	9	27	27	0.054209
8	9	28	28	0.0204534
8	9	1	1	0.0180626
9	10	21	21	1
10	11	29	29	0.782573
10	11	11	11	0.107853
10	11	30	30	0.0992842
10	11	31	31	0.0102894
11	12	32	32	1
12	13	31	31	1
13	14	33	33	1
14	15	34	34	1
15	16	35	35	1
16	17	4	4	0.993077
16	17	11	11	0.00603523
16	17	36	36	0.000887487
17	18	37	37	0.996492
17	18	38	38	0.00350757
18	19	39	39	1
19	20	40	40	1
20	21	41	41	0.874795
20	21	42	42	0.0827369
21	22	31	31	0.879016
21	22	43	43	0.0524346
21	22	42	42	0.0297859
21	22	44	44	0.0126817
22	23	45	45	0.94776
22	23	46	46	0.0522399
23	24	47	47	1
24	25	21	21	0.968019
24	25	48	48	0.0202004
24	25	24	24	0.0117808
25	26	21	21	1
26

with a symbol table:

hello	0
i'm	1
fine	2
suzana 3
how	4
are	5
you	6
mhm	7
sure	8
yes	9
okay	10
ah	11
just	12
a	13
couple	14
of	15
minutes	16
on	17
my	18
way	19
into	20
doctor's	21
appointment	22
oh	23
ooh	24
foreign	25
i	26
foreigners	27
foreigner	28
shawna	29
sean	30
shaun	31
sharon	32
showing	33
show	34
or	35
howard	36
it	37
how're	38
our	39
hard	40
hour	41
is	42
here	43
there	44
ya	45
today	46
avenue	47
hum	48
huh	49
hm	50
wow	51
yeah	52
hey	53
right	54
sir	55
sorry	56
share	57
star	58
no	59
nope	60
know	61
most	62
uh	63
um	64
more	65
enjoy	66
enjoyed	67
er	68
your	69
we're	70
her	71
doctors	72

The bug is thus:

  1. loading a hyp size: 0 when it is not 0
  2. symbol table is ignored and symbols in fst are completely botched, the first two lines should be:
0	1	23/oh	23/oh	0.99158907
0	1	24/ooh    24/ooh     0.008410932

but were

[+++] [23:45:44] [fstalign] 0	1	23/of	23/of	0.99158907
[+++] [23:45:44] [fstalign] 0	1	24/minutes	24/minutes	0.008410932

i. e. ids were mistakenly shifted -8 in mapping to symbols.

  1. strange elements in hyp FST after loading like (never happens in the original fst)? actually the loaded fst looks quite different from the original one!
  2. arcs that are not in hyp fst - like - [+++] [23:45:44] [fstalign] 5 6 6/ 6/ 0.6163309
  3. no alignment
@niedakh niedakh changed the title FST ignores symbols table Case: fstalign ignores symbols table and does not find alignment in a simple transcript Feb 15, 2023
@niedakh
Copy link
Author

niedakh commented Feb 21, 2023

Some progress:

i've started playing with the symbol loading outputs and it turns out it wasn't loading a file (it wasn't there but wasn't failing), after fixing that and adding asr control symbols, I have:

symbol table:

<eps>	0
<oov>	1
<ins>	2
<del>	3
<sub>	4
<inaudible>	5
<silence>	6
<unk>	7
hello	8
i'm	9
fine	10
suzana	11
how	12
are	13
you	14
mhm	15
sure	16
yes	17
okay	18
ah	19
just	20
a	21
couple	22
of	23
minutes	24
on	25
my	26
way	27
into	28
doctor's	29
appointment	30
oh	31
ooh	32
foreign	33
i	34
foreigners	35
foreigner	36
shawna	37
sean	38
shaun	39
sharon	40
showing	41
show	42
or	43
howard	44
it	45
how're	46
our	47
hard	48
hour	49
is	50
here	51
there	52
ya	53
today	54
avenue	55
hum	56
huh	57
hm	58
wow	59
yeah	60
hey	61
right	62
sir	63
sorry	64
share	65
star	66
no	67
nope	68
know	69
most	70
uh	71
um	72
more	73
enjoy	74
enjoyed	75
er	76
your	77
we're	78
her	79
doctors	80

and fstalign correctly prints out the fst:

[2023-02-21 10:40:35.807] [console] [info] loggers initialized
[+++] [10:40:35] [console] fstalign version is 1.6.1
[+++] [10:40:35] [console] reading reference plain text from /data/ctm-fst-align/22-08E6AADCCBB0305AFB_customer/ref.txt
[2023-02-21 10:41:09.182] [console] [info] loggers initialized
[+++] [10:41:09] [console] fstalign version is 1.6.1
[+++] [10:41:09] [console] reading reference ctm from /data/ctm-fst-align/22-08E6AADCCBB0305AFB_customer/ref.ctm
[+++] [10:41:09] [console] reading hypothesis fst from /data/ctm-fst-align/22-08E6AADCCBB0305AFB_customer/hyp.fst
[+++] [10:41:09] [fstalign] starting conversion to int vector
[+++] [10:41:09] [fstalign] converting ref to int vector
[+++] [10:41:09] [ctmloader] creating std::vector<int> for CTM for 27 tokens
[+++] [10:41:09] [fstalign] converting hyp to int vector
[+++] [10:41:09] [FstFileLoader] convertToIntVector isn't implemented for FST inputs
[+++] [10:41:09] [fstalign] Either ref or hyp is really small, skipping over the levenstein distance,  ref size: 27, hyp size: 0
[+++] [10:41:09] [FstFileLoader] Total FST has 27 states.
[+++] [10:41:09] [fstalign] generating ref synonyms from symbol table
[+++] [10:41:09] [fstalign] applying ref synonyms on ref fst
[+++] [10:41:09] [SynonymEngine] we have 0 registered first word rules label id
[+++] [10:41:09] [fstalign] printing ref fst
[+++] [10:41:09] [fstalign] 0	1	0/hello	0/hello	0.0
[+++] [10:41:09] [fstalign] 1	2	1/i'm	1/i'm	0.0
[+++] [10:41:09] [fstalign] 2	3	2/fine	2/fine	0.0
[+++] [10:41:09] [fstalign] 3	4	3/suzana	3/suzana	0.0
[+++] [10:41:09] [fstalign] 4	5	4/how	4/how	0.0
[+++] [10:41:09] [fstalign] 5	6	5/are	5/are	0.0
[+++] [10:41:09] [fstalign] 6	7	6/you	6/you	0.0
[+++] [10:41:09] [fstalign] 7	8	7/mhm	7/mhm	0.0
[+++] [10:41:09] [fstalign] 8	9	8/sure	8/sure	0.0
[+++] [10:41:09] [fstalign] 9	10	9/yes	9/yes	0.0
[+++] [10:41:09] [fstalign] 10	11	10/okay	10/okay	0.0
[+++] [10:41:09] [fstalign] 11	12	11/ah	11/ah	0.0
[+++] [10:41:09] [fstalign] 12	13	12/just	12/just	0.0
[+++] [10:41:09] [fstalign] 13	14	13/a	13/a	0.0
[+++] [10:41:09] [fstalign] 14	15	14/couple	14/couple	0.0
[+++] [10:41:09] [fstalign] 15	16	15/of	15/of	0.0
[+++] [10:41:09] [fstalign] 16	17	16/minutes	16/minutes	0.0
[+++] [10:41:09] [fstalign] 17	18	1/i'm	1/i'm	0.0
[+++] [10:41:09] [fstalign] 18	19	17/on	17/on	0.0
[+++] [10:41:09] [fstalign] 19	20	18/my	18/my	0.0
[+++] [10:41:09] [fstalign] 20	21	19/way	19/way	0.0
[+++] [10:41:09] [fstalign] 21	22	20/into	20/into	0.0
[+++] [10:41:09] [fstalign] 22	23	13/a	13/a	0.0
[+++] [10:41:09] [fstalign] 23	24	21/doctor's	21/doctor's	0.0
[+++] [10:41:09] [fstalign] 24	25	22/appointment	22/appointment	0.0
[+++] [10:41:09] [fstalign] 25	26	10/okay	10/okay	0.0
[+++] [10:41:09] [fstalign] 26	27	10/okay	10/okay	0.0
[+++] [10:41:09] [fstalign] printing hyp fst
[+++] [10:41:09] [fstalign] 0	1	23/oh	23/oh	0.99158907
[+++] [10:41:09] [fstalign] 0	1	24/ooh	24/ooh	0.008410932
[+++] [10:41:09] [fstalign] 1	2	25/foreign	25/foreign	0.8511446
[+++] [10:41:09] [fstalign] 1	2	2/fine	2/fine	0.05130972
[+++] [10:41:09] [fstalign] 1	2	1/i'm	1/i'm	0.047072377
[+++] [10:41:09] [fstalign] 1	2	26/i	26/i	0.0155186895
[+++] [10:41:09] [fstalign] 1	2	27/foreigners	27/foreigners	0.005192776
[+++] [10:41:09] [fstalign] 1	2	28/foreigner	28/foreigner	0.0033090003
[+++] [10:41:09] [fstalign] 1	2	23/oh	23/oh	0.0015563301
[+++] [10:41:09] [fstalign] 2	3	29/shawna	29/shawna	0.3886394
[+++] [10:41:09] [fstalign] 2	3	30/sean	30/sean	0.19079825
[+++] [10:41:09] [fstalign] 2	3	31/shaun	31/shaun	0.10623746
[+++] [10:41:09] [fstalign] 2	3	3/suzana	3/suzana	0.04679449
[+++] [10:41:09] [fstalign] 2	3	2/fine	2/fine	0.034456342
[+++] [10:41:09] [fstalign] 2	3	25/foreign	25/foreign	0.0081548225
[+++] [10:41:09] [fstalign] 2	3	32/sharon	32/sharon	0.004374613
[+++] [10:41:09] [fstalign] 2	3	26/i	26/i	0.002221351
[+++] [10:41:09] [fstalign] 2	3	1/i'm	1/i'm	0.002075816
[+++] [10:41:09] [fstalign] 2	3	33/showing	33/showing	0.0019667444
[+++] [10:41:09] [fstalign] 2	3	34/show	34/show	0.0005650157
[+++] [10:41:09] [fstalign] 3	4	4/how	4/how	0.69604653
[+++] [10:41:09] [fstalign] 3	4	3/suzana	3/suzana	0.15710989
[+++] [10:41:09] [fstalign] 3	4	30/sean	30/sean	0.014679594
[+++] [10:41:09] [fstalign] 3	4	35/or	35/or	0.011887497
[+++] [10:41:09] [fstalign] 3	4	36/howard	36/howard	0.0065641715
[+++] [10:41:09] [fstalign] 3	4	2/fine	2/fine	0.0056938045
[+++] [10:41:09] [fstalign] 3	4	25/foreign	25/foreign	0.0021069725
[+++] [10:41:09] [fstalign] 3	4	26/i	26/i	0.001466121
[+++] [10:41:09] [fstalign] 3	4	29/shawna	29/shawna	0.0013138258
[+++] [10:41:09] [fstalign] 3	4	1/i'm	1/i'm	0.0012702389
[+++] [10:41:09] [fstalign] 3	4	17/on	17/on	0.0009687771
[+++] [10:41:09] [fstalign] 4	5	5/are	5/are	0.5213346
[+++] [10:41:09] [fstalign] 4	5	37/it	37/it	0.17481348
[+++] [10:41:09] [fstalign] 4	5	38/how're	38/how're	0.14042015
[+++] [10:41:09] [fstalign] 4	5	36/howard	36/howard	0.053299483
[+++] [10:41:09] [fstalign] 4	5	3/suzana	3/suzana	0.042188246
[+++] [10:41:09] [fstalign] 4	5	39/our	39/our	0.011979131
[+++] [10:41:09] [fstalign] 4	5	2/fine	2/fine	0.0036785
[+++] [10:41:09] [fstalign] 4	5	35/or	35/or	0.002629472
[+++] [10:41:09] [fstalign] 4	5	30/sean	30/sean	0.0022271618
[+++] [10:41:09] [fstalign] 4	5	40/hard	40/hard	0.002156982
[+++] [10:41:09] [fstalign] 4	5	41/hour	41/hour	0.0016741576
[+++] [10:41:09] [fstalign] 5	6	6/you	6/you	0.6163309
[+++] [10:41:09] [fstalign] 5	6	42/is	42/is	0.18521181
[+++] [10:41:09] [fstalign] 5	6	36/howard	36/howard	0.056179322
[+++] [10:41:09] [fstalign] 5	6	43/here	43/here	0.038890716
[+++] [10:41:09] [fstalign] 5	6	44/there	44/there	0.0326784
[+++] [10:41:09] [fstalign] 5	6	3/suzana	3/suzana	0.026641503
[+++] [10:41:09] [fstalign] 5	6	45/ya	45/ya	0.020304155
[+++] [10:41:09] [fstalign] 5	6	46/today	46/today	0.006060596
[+++] [10:41:09] [fstalign] 5	6	5/are	5/are	0.0041220332
[+++] [10:41:09] [fstalign] 5	6	30/sean	30/sean	0.0033203475
[+++] [10:41:09] [fstalign] 5	6	47/avenue	47/avenue	0.0029174143
[+++] [10:41:09] [fstalign] 6	7	48/hum	48/hum	0.34014535
[+++] [10:41:09] [fstalign] 6	7	24/ooh	24/ooh	0.2984986
[+++] [10:41:09] [fstalign] 6	7	49/huh	49/huh	0.19404508
[+++] [10:41:09] [fstalign] 6	7	10/okay	10/okay	0.016649699
[+++] [10:41:09] [fstalign] 6	7	0/hello	0/hello	0.013604832
[+++] [10:41:09] [fstalign] 6	7	50/hm	50/hm	0.0062978663
[+++] [10:41:09] [fstalign] 6	7	51/wow	51/wow	0.0049225087
[+++] [10:41:09] [fstalign] 6	7	52/yeah	52/yeah	0.0039882683
[+++] [10:41:09] [fstalign] 6	7	23/oh	23/oh	0.0033028126
[+++] [10:41:09] [fstalign] 6	7	53/hey	53/hey	0.0029480883
[+++] [10:41:09] [fstalign] 6	7	54/right	54/right	0.002575561
[+++] [10:41:09] [fstalign] 7	8	55/sir	55/sir	0.43735883
[+++] [10:41:09] [fstalign] 7	8	8/sure	8/sure	0.40650827
[+++] [10:41:09] [fstalign] 7	8	56/sorry	56/sorry	0.038571022
[+++] [10:41:09] [fstalign] 7	8	57/share	57/share	0.010218942
[+++] [10:41:09] [fstalign] 7	8	58/star	58/star	0.009362684
[+++] [10:41:09] [fstalign] 8	9	59/no	59/no	0.81295073
[+++] [10:41:09] [fstalign] 8	9	60/nope	60/nope	0.05420902
[+++] [10:41:09] [fstalign] 8	9	61/know	61/know	0.02045335
[+++] [10:41:09] [fstalign] 8	9	24/ooh	24/ooh	0.018062603
[+++] [10:41:09] [fstalign] 8	9	54/right	54/right	0.012383968
[+++] [10:41:09] [fstalign] 8	9	62/most	62/most	0.007653652
[+++] [10:41:09] [fstalign] 9	10	10/okay	10/okay	1.0
[+++] [10:41:09] [fstalign] 10	11	11/ah	11/ah	0.7825733
[+++] [10:41:09] [fstalign] 10	11	35/or	35/or	0.10785312
[+++] [10:41:09] [fstalign] 10	11	63/uh	63/uh	0.09928422
[+++] [10:41:09] [fstalign] 10	11	13/a	13/a	0.010289361
[+++] [10:41:09] [fstalign] 11	12	12/just	12/just	1.0
[+++] [10:41:09] [fstalign] 12	13	13/a	13/a	1.0
[+++] [10:41:09] [fstalign] 13	14	14/couple	14/couple	1.0
[+++] [10:41:09] [fstalign] 14	15	15/of	15/of	1.0
[+++] [10:41:09] [fstalign] 15	16	16/minutes	16/minutes	1.0
[+++] [10:41:09] [fstalign] 16	17	1/i'm	1/i'm	0.9930773
[+++] [10:41:09] [fstalign] 16	17	35/or	35/or	0.006035227
[+++] [10:41:09] [fstalign] 16	17	64/um	64/um	0.0008874871
[+++] [10:41:09] [fstalign] 17	18	17/on	17/on	0.99649245
[+++] [10:41:09] [fstalign] 17	18	65/more	65/more	0.003507566
[+++] [10:41:09] [fstalign] 18	19	18/my	18/my	1.0
[+++] [10:41:09] [fstalign] 19	20	19/way	19/way	1.0
[+++] [10:41:09] [fstalign] 20	21	20/into	20/into	0.8747955
[+++] [10:41:09] [fstalign] 20	21	66/enjoy	66/enjoy	0.08273692
[+++] [10:41:09] [fstalign] 21	22	13/a	13/a	0.879016
[+++] [10:41:09] [fstalign] 21	22	39/our	39/our	0.052434582
[+++] [10:41:09] [fstalign] 21	22	66/enjoy	66/enjoy	0.029785942
[+++] [10:41:09] [fstalign] 21	22	67/enjoyed	67/enjoyed	0.0126816565
[+++] [10:41:09] [fstalign] 21	22	68/er	68/er	0.00951749
[+++] [10:41:09] [fstalign] 21	22	69/your	69/your	0.007649917
[+++] [10:41:09] [fstalign] 21	22	35/or	35/or	0.0052877315
[+++] [10:41:09] [fstalign] 21	22	70/we're	70/we're	0.0021538541
[+++] [10:41:09] [fstalign] 21	22	71/her	71/her	0.0014728603
[+++] [10:41:09] [fstalign] 22	23	21/doctor's	21/doctor's	0.9477601
[+++] [10:41:09] [fstalign] 22	23	72/doctors	72/doctors	0.052239873
[+++] [10:41:09] [fstalign] 23	24	22/appointment	22/appointment	1.0
[+++] [10:41:09] [fstalign] 24	25	10/okay	10/okay	0.9680188
[+++] [10:41:09] [fstalign] 24	25	54/right	54/right	0.02020042
[+++] [10:41:09] [fstalign] 24	25	56/sorry	56/sorry	0.011780825
[+++] [10:41:09] [fstalign] 25	26	10/okay	10/okay	1.0
[+++] [10:41:09] [walker] starting a walk in the park
[+++] [10:41:09] [walker] we have 0 candidates after 27 loops
[+++] [10:41:09] [fstalign] done walking the graph
terminate called after throwing an instance of 'std::runtime_error'
  what():  no alignment produced
Aborted (core dumped)

So the problem still persists - with correct fst, and easily alignable transcript by hand, the graph walker fails to align the transcripts.

@niedakh
Copy link
Author

niedakh commented Feb 21, 2023

Changing composition approach to standard fixed the issue I think

@nishchalb
Copy link
Contributor

Hi, we don't support FST input yet for the composition we made default in https://github.com/revdotcom/fstalign/releases/tag/1.2.0, so you would have to use the standard composition approach.

[+++] [10:51:17] [fstalign] converting hyp to int vector
[+++] [10:51:17] [FstFileLoader] convertToIntVector isn't implemented for FST inputs
```.

We will make this more clear in our docs, thank you for the detailed issue!

@niedakh
Copy link
Author

niedakh commented Feb 25, 2023

Thank you, happy I could perhaps help someone looking for this in the future! Your library is amazing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants