-
Notifications
You must be signed in to change notification settings - Fork 1
/
readgedcom.py
3166 lines (2526 loc) · 117 KB
/
readgedcom.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
"""
Read a GEDCOM file into a data structure, parse into a dict of
individuals and families for simplified handling; converting to
HTML pages, JSON data, etc.
Public functions:
data = read_file( gedcom_file_name [, settings ] )
detect_loops( data, report_to_stderr )
output_original( data, out_file_name )
set_privatize_flag( data )
unset_privatize_flag( data )
output_reordered( data, person_id, out_file_name )
output_privatized( data, out_file_name )
report_individual_double_facts( data )
report_family_double_facts( data )
report_descendant_report( data )
report_counts( data )
id_list = find_individuals( data, search_tag, search_value, operation='=', only_best=True )
list_of_indi = list_intersection( list1, [list2, ...] )
list_of_indi = list_difference( original, subtract1, [subtract2, ,,,] )
list_of_indi = list_combine( list1, [list2, ...] )
print_individuals( data, id_list )
The input file should be well formed as this code only checks for
a few structural errors. A verification program can be found here:
https://chronoplexsoftware.com/gedcomvalidator/
and
http://ftanalyzer.com
Fixable mistakes are corrected as the data is parsed into the data structure.
If the option to output a privatized file is taken, the mistakes from the
original input will also go into the new file.
Some trouble messages go to stderr.
If something really bad is encountered an exception is thrown.
This code handles only the Gregorian calendar with optional epoch setting of BCE.
Specs at https://gedcom.io/specs/
Some notes on limitations, etc.
- Re-marriages should be additional FAM records
https://www.tamurajones.net/MarriedDivorcedMarriedAgain.xhtml
but it does complicate full siblings
https://www.beholdgenealogy.com/blog/?p=1303
- character sets
The input file should be UTF-8, not ANSEL
This code is released under the MIT License: https://opensource.org/licenses/MIT
Copyright (c) 2022 John A. Andrea
v2.0
"""
import sys
import copy
import re
import datetime
from collections.abc import Iterable
from collections import defaultdict
# Sections to be created by the parsing
PARSED_INDI = 'individuals'
PARSED_FAM = 'families'
PARSED_PLACES = 'places'
PARSED_MESSAGES = 'messages'
PARSED_SECTIONS = [PARSED_INDI, PARSED_FAM, PARSED_MESSAGES]
# GEDCOM v7.0 requires this character sequence at the start of the file.
# It may also be present in older versions (RootsMagic does include it).
FILE_LEAD_CHAR = '\ufeff'
# The "x" becomes a "startwsith" comparison
SUPPORTED_VERSIONS = [ '5.5.1', '5.5.5', '7.0.x' ]
# Section types, listed in order or at least header first and trailer last.
# Some are not valid in GEDCOM 5.5.x, but that's ok if they are not found.
# Including a RootsMagic specific: _evdef, _todo
# Including Legacy specific: _plac_defn, _event_defn
# GEDCOM 7 uses "plac" as the place section, handled below
SECT_HEAD = 'head'
SECT_INDI = 'indi'
SECT_FAM = 'fam'
SECT_PLAC = '_plac'
SECT_TRLR = 'trlr'
NON_STD_SECTIONS = ['_evdef', '_todo', '_plac_defn', '_event_defn']
SECTION_NAMES = [SECT_HEAD, 'subm', SECT_INDI, SECT_FAM, SECT_PLAC, 'obje', 'repo', 'snote', 'sour'] + NON_STD_SECTIONS + [SECT_TRLR]
# From GEDCOM 7.0.1 spec pg 40
FAM_EVENT_TAGS = ['anul','cens','div','divf','enga','marb','marc','marl','mars','marr','even']
# From GEDCOM 7.0.1 spec pg 44
INDI_EVENT_TAGS = ['bapm','barm','basm','bles','buri','cens','chra','conf','crem','deat','emig','fact','fcom','grad','immi','natu','ordn','prob','reti','will','adop','birt','chr','even','resi']
# Other individual tags of interest placed into the parsed section,
# in addition to the event tags and of course the name(s)
# including some less common items which are identification items.
OTHER_INDI_TAGS = ['sex', 'exid', 'fams', 'famc', 'refn', '_uid', 'uuid', '_uuid']
# Other family tags of interest placed into the parsed section,
# in addition to the event tags
FAM_MEMBER_TAGS = ['husb', 'wife', 'chil']
OTHER_FAM_TAGS = []
# Events in the life of a person (or family) which can only occur once
# but might have more than one entry because research is inconclusive.
# These are the ones which will occur in the 'best' lists. See below
# for proved/disproves,etc. and example code.
# A person could be buried multiple times, immigrate multiple times, etc.
# and such entries would be a list and if proven/disproved they would need
# to be checked on their own.
INDI_SINGLE_EVENTS = ['name','sex','birt','deat']
# Assuming that a couple married the second time constiutes a second
# family entry. A couple could for instance get engaged more than once.
FAM_SINGLE_EVENTS = ['marr','div','anul']
# Individual records which are only allowed to occur once.
# However they will still be placed into an array to be consistent
# with the other facts/events.
# An exception will be thrown if a duplicate is found and processing will exit.
# Use of a validator is recommended.
# Not 100% sure EXID should be on this list.
# Not 100% sure REFN should not be on this list.
ONCE_INDI_TAGS = ['exid', '_uid', 'uuid', '_uuid']
# Family items allowed only once.
# See the description of individuals only once.
ONCE_FAM_TAGS = ['husb','wife']
# There are other important records, such as birth and death which are allowed
# to occur more than once (research purposes).
# A meta-structure will be added to each individual pointing to the "best" event,
# the first one, or the first proven one, or the first primary one.
BEST_EVENT_KEY = 'best-events'
# Tags for proof and primary in the case of multiple event records.
# These are RootsMagic specific. A future version might try to detect the product
# which exported the GEDCOM file. Though such options might not be elsewhere.
EVENT_PRIMARY_TAG = '_prim'
EVENT_PRIMARY_VALUE = 'y'
EVENT_PROOF_TAG = '_proof'
EVENT_PROOF_DEFAULT = 'other'
EVENT_PROOF_VALUES = {'disproven':0, EVENT_PROOF_DEFAULT:1, 'proven':2}
# Sub parts to not generally display
LEVEL2_SUB_NAMES = ['npfx', 'nick', 'nsfx']
# Name sub-parts in order of display appearance
LEVEL2_NAMES = ['givn', 'surn'] + LEVEL2_SUB_NAMES
# This code doesn't deal with calendars, but need to know what to look for
# in case of words before a date.
CALENDAR_NAMES = [ 'gregorian', 'hebrew', 'julian', 'french_r' ]
# From GEDCOM 7.0.3 spec pg 21
DATE_MODIFIERS = [ 'abt', 'aft', 'bef', 'cal', 'est' ]
# Alternate date modifiers which are not in the spec but might be in use.
# Give their allowed replacement.
# Ancestry may place a period after the abbreviation.
ALT_DATE_MODIFIERS = {'about':'abt', 'after':'aft', 'before':'bef',
'ca':'abt', 'circa':'abt',
'calculated':'cal', 'estimate':'est', 'estimated':'est',
'abt.':'abt', 'aft.':'aft', 'bef.':'bef',
'ca.':'abt', 'cal.':'cal', 'est.':'est' }
# The defacto-standard replacement for an unknown name
UNKNOWN_NAME = '[-?-]' #those are supposted to be en-dashes - will update later
# Names. zero included in the zero'th index location for one-based indexing
MONTH_NAMES = ['zero','jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']
# Month name to number. "may" is not included twice.
MONTH_NUMBERS = {'jan':1, 'feb':2, 'mar':3, 'apr':4, 'may':5, 'jun':6,
'jul':7, 'aug':8, 'sep':9, 'oct':10, 'nov':11, 'dec':12,
'january':1, 'february':2, 'march':3, 'april':4, 'june':6,
'july':7, 'august':8, 'september':9, 'october':10, 'november':11, 'december':12}
# Bad dates can be attempted to be fixed or cause exit
DATE_ERR = 'Malformed date:'
# The message for data file troubles
DATA_ERR = 'GEDCOM error. Use a validator. '
DATA_WARN = 'Warning. Use a validator. '
# What to do with unknown sections
UNK_SECTION_ERR = 'Unknown section: '
UNK_SECTION_WARN = 'Warning. Ignoring unknown section:'
# dd mmm yyyy - same format as gedcom
TODAY = datetime.datetime.now().strftime("%d %b %Y")
# Settings for the privatize flag
PRIVATIZE_FLAG = 'privatized'
PRIVATIZE_OFF = 0
PRIVATIZE_MIN = PRIVATIZE_OFF + 1
PRIVATIZE_MAX = PRIVATIZE_MIN + 1
# Some checking to help prevent typos. Failure will throw an exception and exit processing.
# I don't imagine the checking causes much of a performance hit.
# This is not a passed in as a setting.
SELF_CONSISTENCY_CHECKS = True
SELF_CONSISTENCY_ERR = 'Program code inconsistency:'
# complain or threw exception for years outside these,
# but can be over-ridden with extend-years option
min_valid_year = 1100
max_valid_year = 2100
# The detected version of the input file. Treat as a global.
version = ''
# This is the operational settings. Treat as a global
run_settings = dict()
# A place to save all messages which will be copied into the output data. Treat as a global.
all_messages = []
# This becomes a global into the convert routine
unicode_table = dict()
def list_intersection( *lists ):
""" For use with results of find_individuals.
Return the intersection of all the given lists. """
result = set()
first_loop = True
for l in lists:
if isinstance( l, Iterable ):
if first_loop:
result = set( l )
first_loop = False
else:
result.intersection_update( set(l) )
return list( result )
def list_difference( original, *subtract ):
""" For use with results of find_individuals.
Return the list "original" with other lists removed. """
result = set( original )
for l in subtract:
if isinstance( l, Iterable ):
result.difference_update( set(l) )
return list( result )
def list_combine( *lists ):
""" For use with results of find_individuals.
Return as one list with no duplicates. """
result = set()
for l in lists:
if isinstance( l, list ):
result.update( set(l) )
return list( result )
def setup_unicode_table():
""" Define utf-8 characters to convert to unicode characters.
Favouring (Latin) English and French names.
Including backslash and quotes to prevent trouble in output as quoted strings, etc.
"""
# https://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
# https://www.compart.com/en/unicode/block
# Other related conversions for ANSEL
# https://www.tamurajones.net/ANSELUnicodeConversion.xhtml
# https://www.tamurajones.net/GEDCOMANSELToUnicode.xhtml
lookup_table = dict()
lookup_table['back slash'] = [ '\\', '\\u005c' ]
lookup_table['back quote'] = [ '`', '\\u0060' ]
lookup_table['double quote'] = [ '"', '\\u0022' ]
lookup_table['single quote'] = [ "'", '\\u0027' ]
lookup_table['en dash'] = [ '\xe2\x80\x93', '\\u2013' ]
lookup_table['em dash'] = [ '\xe2\x80\x94', '\\u2014' ]
lookup_table['A grave'] = [ '\xc0', '\\u00c0' ]
lookup_table['a grave'] = [ '\xe0', '\\u00e0' ]
lookup_table['A acute'] = [ '\xc1', '\\u00c1' ]
lookup_table['a acute'] = [ '\xe1', '\\u00e1' ]
lookup_table['A circumflex'] = [ '\xc2', '\\u00c2' ]
lookup_table['a circumflex'] = [ '\xe2', '\\u00e2' ]
lookup_table['C cedilia'] = [ '\xc7', '\\u00c7' ]
lookup_table['c cedilia'] = [ '\xe7', '\\u00e7' ]
lookup_table['E acute'] = [ '\xc9', '\\u00c9' ]
lookup_table['e acute'] = [ '\xe9', '\\u00e9' ]
lookup_table['E grave'] = [ '\xc8', '\\u00c8' ]
lookup_table['e grave'] = [ '\xe8', '\\u00e8' ]
lookup_table['I grave'] = [ '\xcc', '\\u00cc' ]
lookup_table['i grave'] = [ '\xec', '\\u00ec' ]
lookup_table['I acute'] = [ '\xcd', '\\u00cd' ]
lookup_table['i acute'] = [ '\xed', '\\u00ed' ]
lookup_table['I circumflex'] = [ '\xce', '\\u00ce' ]
lookup_table['i circumflex'] = [ '\xee', '\\u00ee' ]
lookup_table['O grave'] = [ '\xd2', '\\u00d2' ]
lookup_table['o grave'] = [ '\xf2', '\\u00f2' ]
lookup_table['O acute'] = [ '\xd3', '\\u00d3' ]
lookup_table['o acute'] = [ '\xf3', '\\u00f3' ]
lookup_table['O circumflex'] = [ '\xd4', '\\u00d4' ]
lookup_table['o circumflex'] = [ '\xf4', '\\u00f4' ]
lookup_table['O diaresis'] = [ '\xd6', '\\u00d6' ]
lookup_table['o diaresis'] = [ '\xf6', '\\u00f6' ]
lookup_table['U grave'] = [ '\xd9', '\\u00d9' ]
lookup_table['u grave'] = [ '\xf9', '\\u00f9' ]
lookup_table['U acute'] = [ '\xda', '\\u00da' ]
lookup_table['u acute'] = [ '\xfa', '\\u00da' ]
lookup_table['U circumflex'] = [ '\xdb', '\\u00db' ]
lookup_table['u circumflex'] = [ '\xfb', '\\u00fb' ]
lookup_table['U diaresis'] = [ '\xdc', '\\u00dc' ]
lookup_table['u diaresis'] = [ '\xfc', '\\u00fc' ]
lookup_table['Sharp'] = [ '\xdf', '\\u00df' ]
return lookup_table
def convert_to_unicode( s ):
""" Convert common utf-8 encoded characters to unicode for the various display of names etc.
The pythonic conversion routines don't seem to do the job.
"""
text = s.strip()
for item in unicode_table:
text = text.replace( unicode_table[item][0], unicode_table[item][1] )
return text
def convert_to_html( s ):
""" Convert common utf-8 encoded characters to html for the various display of names etc."""
# https://dev.w3.org/html5/html-author/charref
text = s.strip()
text = text.replace('&','&smp;').replace('<','<').replace('>','>' )
text = text.replace('"','"').replace("'",''')
text = text.replace('`','`').replace('\\','\')
# encode generates a byte array, decode goes back to a string
text = text.encode( 'ascii', 'xmlcharrefreplace' ).decode( 'ascii' )
return text
def print_warn( message ):
global all_messages
all_messages.append( message )
if run_settings['display-gedcom-warnings']:
print( message, file=sys.stderr )
def concat_things( *args ):
""" Behave kinda like a print statement: convert all the things to strings.
Return the large concatinated string. """
result = ''
space = ''
for arg in args:
result += space + str(arg).strip()
space = ' '
return result
def setup_settings( settings=None ):
""" Set the settings which control how the program operates.
Return a dict with the defaults or the user supplied values. """
new_settings = dict()
if settings is None:
settings = dict()
if not isinstance( settings, dict ):
settings = dict()
defaults = dict()
defaults['show-settings'] = False
defaults['display-gedcom-warnings'] = False
defaults['exit-on-bad-date'] = False
defaults['exit-on-unknown-section'] = False
defaults['exit-on-no-individuals'] = True
defaults['exit-on-no-families'] = False
defaults['exit-on-missing-individuals'] = False
defaults['exit-on-missing-families'] = False
defaults['exit-if-loop'] = False
defaults['only-birth'] = False
defaults['extend-years'] = False
for item in defaults:
setting = defaults[item]
if item in settings:
# careful if a default is set to "None"
if isinstance( settings[item], type(setting) ):
setting = settings[item]
else:
print( 'Ignoring invalid setting for', item, 'expecting', type(setting), file=sys.stderr )
new_settings[item] = setting
# report any typos
mistakes = []
for item in settings:
if item not in defaults:
mistakes.append( item )
if mistakes:
print( 'Invalid setting(s):', mistakes, file=sys.stderr )
print( 'Expecting one of:', file=sys.stderr )
for item in defaults:
print( item, 'default', defaults[item], file=sys.stderr )
if new_settings['show-settings']:
for item in new_settings:
print( 'Setting', item, '=', new_settings[item], file=sys.stderr )
return new_settings
def string_like_int( s ):
""" Given a string, return true if it contains only digits. """
if re.search( r'\D', s ):
return False
return True
def yyyymmdd_to_date( yyyymmdd ):
#01234567
""" Return the human form of dd mmm yyyy. """
y = yyyymmdd[:4]
m = yyyymmdd[4:6]
d = yyyymmdd[6:]
return d + ' ' + MONTH_NAMES[int(m)] + ' ' + y
def comparable_before_today( years_ago ):
""" Given a number of years before now, return yyyymmdd as that date."""
# The leap year approximation is ok, this isn't for exact comparisons.
leap_days = years_ago % 4
old_date = datetime.datetime.now() - datetime.timedelta( days = (365 * years_ago) + leap_days )
return '%4d%02d%02d' % ( old_date.year, old_date.month, old_date.day )
def strip_lead_chars( line ):
""" Remove the file start characters from the file's first line."""
return line.replace( FILE_LEAD_CHAR, '' )
def month_name_to_number( month_name ):
""" Using the dict of month names, return the int month number, else zero if not found."""
if month_name and month_name.lower() in MONTH_NUMBERS:
return MONTH_NUMBERS[month_name.lower()]
return 0
def detect_loops( data, print_report=True ):
""" Check that every individual cannot be their own sporse,
sibling, or ancestor.
Return 'true' if such a loop is detected, and print to output
for any such conditions if output is selected.
Note that birth and adoption relationships are treated the same.
"""
result = False
assert isinstance( data, dict ), 'Non-dict passed as the data parameter.'
assert PARSED_INDI in data, 'Passed data appears to not be from read_file'
i_key = PARSED_INDI
f_key = PARSED_FAM
def get_info( indi ):
info = get_indi_display( data[i_key][indi] )
out = str(data[i_key][indi]['xref']) + '/ '
out += info['name']
out += '(' + info['birt'] + '-' + info['deat'] + ')'
return out
def show_fam( indi, fam, message ):
if print_report:
fam_info = data[f_key][fam]['xref']
print( get_info(indi), message + ' in family', fam_info, file=sys.stderr )
def show_path( path ):
if print_report:
print( 'People involved in a loop:', file=sys.stderr )
for indi in path:
print( ' ', get_info(indi), file=sys.stderr )
def check_partners():
result = False
tag = 'fams'
for indi in data[i_key]:
if tag in data[i_key][indi]:
for fam in data[i_key][indi][tag]:
partners = []
for partner in ['wife','husb']:
if partner in data[f_key][fam]:
partners.append( data[f_key][fam][partner][0] )
if len( partners ) == 2 and partners[0] == partners[1]:
result = True
show_fam( indi, fam, 'Double partners' )
return result
def check_siblings():
result = False
tag = 'famc'
for indi in data[i_key]:
if tag in data[i_key][indi]:
for fam in data[i_key][indi][tag]:
count = 0
for child in data[f_key][fam]['chil']:
if child == indi:
count += 1
if count > 1:
result = True
show_fam( indi, fam, 'Double child' )
return result
def check_self_ancestor( start_indi, fam, path, all_loopers ):
result = False
for partner_type in ['wife','husb']:
if partner_type in data[f_key][fam]:
partner = data[f_key][fam][partner_type][0]
# skip if already confirmed
if partner not in all_loopers:
if partner in path:
# have we come back to the beginning
if partner == start_indi:
# and don't try to look back to more ancestors
result = True
show_path( path )
all_loopers.extend( path )
else:
if 'famc' in data[i_key][partner]:
for parent_fam in data[i_key][partner]['famc']:
if check_self_ancestor( start_indi, parent_fam, path + [partner], all_loopers ):
result = True
return result
def check_ancestors():
result = False
people_in_a_loop = []
for indi in data[i_key]:
if indi not in people_in_a_loop:
if 'famc' in data[i_key][indi]:
for fam in data[i_key][indi]['famc']:
if check_self_ancestor( indi, fam, [indi], people_in_a_loop ):
result = True
return result
# don't stop if a true condition is met, check everywhere
if check_partners():
result = True
if check_siblings():
result = True
if check_ancestors():
result = True
return result
def add_file_back_ref( file_tag, file_index, parsed_section ):
""" Map back from the parsed section to the correcponding record in the
data read from directly from the input file."""
parsed_section['file_record'] = { 'key':file_tag, 'index':file_index }
def copy_section( from_sect, to_sect, data ):
""" Copy a portion of the data from one section to another."""
if from_sect in SECTION_NAMES:
if from_sect in data:
data[to_sect] = copy.deepcopy( data[from_sect] )
else:
data[to_sect] = []
else:
print_warn( concat_things( 'Cant copy unknown section:', from_sect ) )
def extract_indi_id( tag ):
""" Use the id as the xref which the spec. defines as "@" + xref + "@".
Rmove the @ and change to lowercase leaving the "i"
Ex. from "@i123@" get "i123"."""
return tag.replace( '@', '' ).lower().replace( ' ', '' )
def extract_fam_id( tag ):
""" Sumilar to extract_indi_id. """
return tag.replace( '@', '' ).lower().replace( ' ', '' )
def output_sub_section( level, outf ):
""" Print a portion of the data to the output file handle."""
print( level['in'], file=outf )
for sub_level in level['sub']:
output_sub_section( sub_level, outf )
def output_section( section, outf ):
""" Output a portion of the data to the given file handle. """
for level in section:
output_sub_section( level, outf )
def output_original( data, file ):
"""
Output the original data (unmodified) to the given file handle.
Essentially copying the input gedcom file.
Parameters:
data: data structure retured from the function read_file.
file: name of the output file.
"""
assert isinstance( data, dict ), 'Non-dict passed as the data parameter.'
assert isinstance( file, str ), 'Non-string passed as the filename parameter.'
assert PARSED_INDI in data, 'Passed data appears to not be from read_file'
global version
with open( file, 'w', encoding='utf-8' ) as outf:
if not version.startswith( '5' ):
print( FILE_LEAD_CHAR, end='' )
for sect in SECTION_NAMES:
if sect in data and sect != SECT_TRLR:
output_section( data[sect], outf )
# unknown sections
for sect in data:
if sect not in SECTION_NAMES + PARSED_SECTIONS:
output_section( data[sect], outf )
# finally the trailer
output_section( data[SECT_TRLR], outf )
def output_reordered( data, person_to_reorder, file ):
"""
Output the original data to the given file handle with selected person
moved to the front of the file to become the root person.
Parameters:
data: data structure retured from the function read_file.
person_to_reorder: individual id of person to move.
file: name of the output file.
"""
assert isinstance( data, dict ), 'Non-dict passed as the data parameter.'
assert isinstance( file, str ), 'Non-string passed as the filename parameter.'
assert PARSED_INDI in data, 'Passed data appears to not be from read_file'
global version
def output_individual_sub( person_section ):
if 'sub' in person_section:
for sub_section in person_section['sub']:
print( sub_section['in'], file=outf )
output_individual_sub( sub_section )
def output_individual( person_data ):
print( person_data['in'], file=outf )
output_individual_sub( person_data )
# if not found, complain but continue
do_reorder = True
file_index = None
if person_to_reorder is None:
do_reorder = False
print( 'Selected individual to reorder is not found', file=sys.stderr )
else:
if person_to_reorder in data[PARSED_INDI]:
# maybe the person is already at the front
file_index = data[PARSED_INDI][person_to_reorder]['file_record']['index']
if file_index == 0:
do_reorder = False
else:
do_reorder = False
print( 'Selected individual to reorder is not found', file=sys.stderr )
with open( file, 'w', encoding='utf-8' ) as outf:
if not version.startswith( '5' ):
print( FILE_LEAD_CHAR, end='' )
for sect in SECTION_NAMES:
if sect in data and sect != SECT_TRLR:
if sect == SECT_INDI and do_reorder:
output_individual( data[SECT_INDI][file_index] )
for indi, indi_section in enumerate( data[SECT_INDI] ):
if indi != file_index:
output_individual( indi_section )
else:
output_section( data[sect], outf )
# the ones which have been known
for sect in data:
if sect not in SECTION_NAMES + PARSED_SECTIONS:
output_section( data[sect], outf )
# finally the trailer
output_section( data[SECT_TRLR], outf )
def get_parsed_year( data ):
""" Return only the year portion from the given data section, or an empty string.
The "data" should be the part of the parsed section down to the "date" index."""
value = ''
if data['is_known']:
modifier = data['min']['modifier'].upper()
if modifier:
value = modifier + ' '
value += str( data['min']['year'] )
if data['is_range']:
value += ' '
modifier = data['max']['modifier'].upper()
if modifier:
value += modifier + ' '
value += str( data['max']['year'] )
return value
def get_reduced_date( lookup, parsed_data ):
""" Return the date for the given data section, or an empty string.
The event to lookfor is in lookup['key'] and lookup['index']
where the index is the i'th instance of the event named by the key. """
# It must exist in the parsed data if the event existed in the input file
# except that it might exist in an empty state with sub-records
value = ''
k = lookup['key']
i = lookup['index']
if 'date' in parsed_data[k][i]:
value = get_parsed_year( parsed_data[k][i]['date'] )
return value
def output_section_no_dates( section, outf ):
""" Print a section of the data to the file handle, skipping any date sub-sections."""
for level in section:
if level['tag'] != 'date':
print( level['in'], file=outf )
output_section_no_dates( level['sub'], outf )
def output_privatized_section( level0, priv_setting, event_list, parsed_data, outf ):
""" Print data to the given file handle with the data reduced based on the privatize setting.
'level0' is the un-parsed section correcponding to the
'parsed_data' section for an individual or family.
'event_list' contains the names of events which are likely to contain dates."""
print( level0['in'], file=outf )
for level1 in level0['sub']:
parts = level1['in'].split( ' ', 2 )
tag1 = parts[1].lower()
if tag1 in event_list:
if priv_setting == PRIVATIZE_MAX:
if tag1 == 'even':
# This custom event is output differently than the regular events
# such as birt, deat, etc.
print( level1['in'], file=outf )
# continue, but no dates
output_section_no_dates( level1['sub'], outf )
else:
# For full privatization this event and subsection is skipped
# except it must be shown that the event is flagged as existing
print( parts[0], parts[1], 'Y', file=outf )
else:
# otherwise, partial privatization, reduce the detail in the dates
print( level1['in'], file=outf )
for level2 in level1['sub']:
parts = level2['in'].split( ' ', 2 )
tag2 = parts[1].lower()
if tag2 == 'date':
# use the partly hidden date
print( parts[0], parts[1], get_reduced_date( level1['parsed'], parsed_data), file=outf )
else:
print( level2['in'], file=outf )
# continue with the rest
output_section( level2['sub'], outf )
else:
# Not an event. A date in here is accidental information
output_sub_section( level1, outf )
def output_privatized_indi( level0, priv_setting, data_section, outf ):
""" Print an individual to the output handle, in privatized format."""
output_privatized_section( level0, priv_setting, INDI_EVENT_TAGS, data_section, outf )
def output_privatized_fam( level0, priv_setting, data_section, outf ):
""" Print a family to the output handle, in privatized format."""
output_privatized_section( level0, priv_setting, FAM_EVENT_TAGS, data_section, outf )
def check_section_priv( item, data ):
""" Return the value of the privatization flag for the given individual or family."""
return data[item][PRIVATIZE_FLAG]
def check_fam_priv( fam, data ):
""" Return the value of the privatization flag for the given family. """
return check_section_priv( extract_fam_id( fam ), data[PARSED_FAM] )
def check_indi_priv( indi, data ):
""" Return the value of the privatization flag for the given individual. """
return check_section_priv( extract_indi_id( indi ), data[PARSED_INDI] )
def output_privatized( data, file ):
""""
Print the data to the given file name. Some data will not be output.
Parameters:
data: the data structure returned from the function read_file.
file: name of the file to contain the output.
See the function set_privatize_flag for the settings.
set_privatize_flag is optional, but should be called if this output function is used.
"""
assert isinstance( data, dict ), 'Non-dict passed as data parameter'
assert isinstance( file, str ), 'Non-string passed as the filename parameter'
assert PARSED_INDI in data, 'Passed data appears to not be from read_file'
# Working with the original input lines
# some will be dropped and some dates will be modified
# based on the privatize setting for each person and family.
isect = PARSED_INDI
fsect = PARSED_FAM
with open( file, 'w', encoding='utf-8' ) as outf:
if not version.startswith( '5' ):
print( FILE_LEAD_CHAR, end='' )
for sect in SECTION_NAMES:
if sect in data and sect != SECT_TRLR:
if sect == SECT_INDI:
for section in data[sect]:
indi = extract_indi_id( section['tag'] )
priv_setting = check_indi_priv( indi, data )
if priv_setting == PRIVATIZE_OFF:
output_sub_section( section, outf )
else:
output_privatized_indi( section, priv_setting, data[isect][indi], outf )
elif sect == SECT_FAM:
for section in data[sect]:
fam = extract_fam_id( section['tag'] )
priv_setting = check_fam_priv( fam, data )
if priv_setting == PRIVATIZE_OFF:
output_sub_section( section, outf )
else:
output_privatized_fam( section, priv_setting, data[fsect][fam], outf )
else:
output_section( data[sect], outf )
# unknown sections
for sect in data:
if sect not in SECTION_NAMES + PARSED_SECTIONS:
output_section( data[sect], outf )
# finally the trailer
output_section( data[SECT_TRLR], outf )
def confirm_gedcom_version( data ):
""" Return the GEDCOM version number as detected in the input file.
Raise ValueError exception if no version or unsupported version."""
# This should be called as soon as a non-header section is found
# to ensure the remainder of the file can be handled.
version = None
sect = SECTION_NAMES[0] # head
if data[sect]:
for level1 in data[sect][0]['sub']:
if level1['tag'] == 'gedc':
for level2 in level1['sub']:
if level2['tag'] == 'vers':
version = level2['value']
break
if version:
ok = False
for supported in SUPPORTED_VERSIONS:
if 'x' in supported:
# ex: change "7.0.x" to "7.0." and see if that matches "7.0.3"
with_wildcard = re.sub( r'x.*', '', supported )
if version.startswith( with_wildcard ):
ok = True
break
else:
if version == supported:
ok = True
break
if not ok:
raise ValueError( 'Version not supported:' + str(version) )
else:
raise ValueError( 'Version not detected in header section' )
else:
raise ValueError( 'Header section not detected' )
return version
def line_values( input_line ):
"""
For the given line from the GEDCOM file return a dict of
{
in: exact input line from the file,
tag: the second item on the input line lowercased,
value: input line after tag (or None),
sub: empty array to be used for sub elements
}.
"""
# example:
# 1 CHAR IBM WINDOWS
# becomes
# { in:'1 CHAR IBM WINDOWS', tag:'char', value:'IBM WINDOWS', sub:[] }
#
# example:
# 0 @I32@ INDI
# becomes
# { in:'0 @I32@ INDI', tag:'@i32@', value:'INDI', sub:[] }
#
# example:
# 2 DATE 14 DEC 1895
# becomes
# { in:'2 DATE 14 DEC 1895', tag:'date', value:'14 DEC 1895', sub:[] }
data = dict()
parts = input_line.split(' ', 2)
data['in'] = input_line
data['tag'] = parts[1].lower()
value = None
if len(parts) > 2:
value = parts[2]
data['value'] = value
data['sub'] = []
return data
def date_to_comparable( original ):
"""
Convert a date to a string of format 'yyyymmdd' for comparison with other dates.
Returns a dict:
( 'value':'yyyymmdd', 'malformed':boolean, 'form':useful-format-specifier )
where "malformed" is True if the original had to be repaired to be usable,
where "form" is 'yyyy' or 'yyyymm' or 'yyyymmdd' or '' depending on how much
of the date is useful even though a whole date-like yyyymmdd is returned
The prefix may contain 'gregorian',
the suffix may contain 'bce',
otherwise the date should be well formed, i.e. valid digits and valid month name.
See the GEDCOM spec.
A malformed portion may be converted to a "01", or might throw an exception
if the crash_on_bad_date flag is True (default False).
ValueError is thrown for a non-gregorian calendar.
"""
# examples:
# '7 nov 1996' returns '19961107' and form 'yyyymmdd'
# 'nov 1996' returns '19961101' and form 'yyyymm'
# '1996' returns '19960101' and form 'yyyy'
# '' returns '' and form ''
# 'seven nov 1996' returns '19961101' and form 'yyyymm' or throws ValueError
# '7 never 1996' returns '19960107' and form 'yyyy' or throws ValueError
# '7 nov ninesix' returns '00011107' and form '' and malformed=True or throws ValueError
default_day = 1
default_month = 1
default_year = 1
exit_bad_date = run_settings['exit-on-bad-date']