|
Post by Uncle Buddy on May 15, 2022 6:44:14 GMT -8
Starting over frequently has been my salvation as a neophyte code-writing maniac. Oftentimes, starting over a lot sooner would have been a good idea. And in my case anyway, it isn't often that I should base my new version on my old version. The best parts could be brought into the new version, but only if they were really good. The best ideas come after totally messing up a few times, because then I know from experience what to not do and why. With the dumb ideas pushed aside by hard-won wisdom, the simple and right way to do something is often found hiding in plain sight. Before groping in the dark a few times, sometimes there is no such thing as plain sight.
The first time I started reading GEDCOM, I instinctively isolated each record (each group of lines starting with a zero) into a big dictionary. The lines were split into little lists, one list per line. The whole dictionary was read and its data input to the database after reading the whole GEDCOM file. The scariest part of the code was isolating the lines into separate records, if I recall correctly. I had to use indexes which would start where I left off instead of starting at zero each time, so it got confusing.
I got the impression that this was a waste of resources, so I started over. Why not read each line and input its data on the fly? This was easier to fool around with, and I discovered an important trick, an easy way to get rid of the indexing. The problem was with referring among lines, since it's just a text file and the lines can't read each other. Indexing the lines was the hard way. The easy way was to assign a variable, such as source_id, so that while reading the citation in the next line, the now-needed source_id could be used.
But I found that splitting each line into a small list was still an essential part of the process, so nothing much had been saved by inputting data to the database on the fly and not saving the little lists. In fact this might have been slower, since often an insert query could only create part of a needed record in the database, and would have to be followed up with an update query. By creating a large dictionary for all the values, that sort of thing should be avoidable.
And I'd cheated by hard-coding sections of code like this:
if line.startswith("0"): ... elif line.startswith("1"): ... elif line.startswith("2"): ...
This would work just fine if there were only five possible levels, but GEDCOM allows the initial number at the beginning of each line to go up to 99. I'm pretty sure that the numbers are meant to be used relative to each other, so the hard-coded stuff would just lead down a blind alley. That's enough to necessitate the beginning of a new rewrite. This time I will try to incorporate the good stuff mentioned above while discontinuing the bad stuff.
For one thing, smaller collections would be nice, instead of one dict for the whole GEDCOM. This would enable me to write less deeply nested for loops, which is always a great idea. Instead of one dict, there could be a dict for INDI records, a dict for SOUR records, etc.
After the amount of time this has taken away from banging my head against the families GUI table in Treebard GPS, let me assure you that one of GEDCOM's scary advantages over us is that fiddling with it is fun. What I'm supposed to do next (if this was a conspiracy theory) is to never get back to Treebard, and spend the rest of my life trying to become a GEDCOM savant.
And by the time I became a GEDCOM savant, I'd have so much of my life invested in it that I'd never listen to the voice of reason again; GEDCOM would naturally be the only way to get the job done.
|
|
|
Post by Uncle Buddy on May 15, 2022 7:01:53 GMT -8
A wise old owl lived in an oak. The more he saw the less he spoke. The less he spoke the more he heard. Why can’t we all be like that wise old bird?
Yeah right. I'll tell you why. Because that might be smart, but it would be inhuman; downright fowl.
And now for something completely different. This is just an untested brainstorm, and I'm posting it as is.
About the kind of transformation that the lines of GEDCOM text have to undergo. What kind of collection should the lines of text become? A nested dict won't work because for example if there are three RESI tags subordinate to the same INDI tag, then each one you add will overwrite the value of the last one because there can only be one RESI key at the same level of the dict. [Actually that arrangement could be made to work by making it more complex.]
The right solution is not to convert the overly clever GEDCOM structure into just another clever, rigid structure. It needs to be an easily useful storage cabinet for the data. The organization within the collection(s) should reflect the way the data is being used, not some clever idea or the shortest code etc. ?Don't store the lines, they'd just have to be read again.
But instead of inserting the data to the database on the fly, insert data to the dict on the fly, but the dict has to be structured to match treebard exactly so the final transfer step is effortless...
self.persons: { '@I4@': {'NAME': [{'birth_name': 'David /Flood/', 'pseudonym': []}], 'event': [{'residence': [{'date': '', 'place': '', 'citation': [{'text': '', 'source': None}, {'text': '', 'source': None}]}, {{'date': '', 'place': '', 'citation': [{'text': '', 'source': None}, {'text': '', 'source': None}]}}, {{'date': '', 'place': '', 'citation': [{'text': '', 'source': None}, {'text': '', 'source': None}]}}]}], }
...Since the database uses types as keys, the dict should do the same thing. Also, don't fool around with lines at all if possible; loop thru each record, get all RESI and put them where they go. To the greatest possible degree, the structure of the master collection should reflect the way SQL works, eg instead of this...
0 @I4@ INDI 1 NAME David /Flood/ 2 SOUR @S2@ 3 PAGE 10 September 1962, page 2, Samuel F. Flood obituary 1 SEX M 1 RESI 2 DATE 1962 2 PLAC Meridian, Lauderdale County, Mississippi 2 SOUR @S2@ 3 PAGE 10 September 1962, page 2, Samuel F. Flood obituary 1 RESI 2 DATE 1962 2 PLAC Gulfport, Harrison County, Mississippi 2 SOUR @S2@ 3 PAGE 26 June 1962, page 2, obituary, "Mrs. Lora Grace Jones"
... you have a structure that literally follows the db record with foreign keys structure, IE one db record is shown as one line, or two at the most. Or maybe what I'm shooting for is for each zero line to become a primary key record in a table, and each one line to become a foreign key line in a table:
0 @I4@ INDI 1 NAME David /Flood/ (SOUR @S2@ [PAGE 10 September 1962, page 2, Samuel F. Flood obituary]) 1 SEX M 1 RESI (DATE 1962) (PLAC Meridian, Lauderdale County, Mississippi) (@S2@) [PAGE 10 September 1962, page 2, Samuel F. Flood obituary] 1 RESI (DATE 1962) (PLAC Gulfport, Harrison County, Mississippi) (SOUR @S2@) [PAGE 26 June 1962, page 2, obituary, "Mrs. Lora Grace Jones"]
If I'm not careful I'll fall into the trap of replacing GEDCOM with better GEDCOM (another text-based structure). Also the nested structure doesn't always do the trick, for example with the name, source and citation above, I described that problem in an earlier post today.
My main point above is that too much emphasis has apparently gone into making GEDCOM pretty. Human readable. If associated lines were literally all on the same line, it would save a big headache. The computer doesn't care how long a line of text is, as far as I know. I might be wrong about that. But there is a new idea blossoming, inspired by this line of thinking.
Instead of reading the lines and changing them into sub-lists of some structure that reflects the target data structure, do this: concatenate the subordinate lines. Keep the text as text, don't change anything, just concatenate lines 1 + 2 + 3 + etc. When there are only 1 and 0 lines, then parse read the lines and put each line into the database on the fly as I was hoping to do.
In this arrangement, each line is a complete sub-record, not a fragment of a sub-record. To signal a record subordinate to the zero level, use brackets of any kind. If the next sub-record uses the same kind of brackets, it's a sibling. If it uses a different kind of bracket, it's a child. If you end up wanting to split the lines into lists, you can split them at the brackets. Splitting them on )( creates one kind of list. Splitting them on ]( creates another, and splitting them on )[ creates another.
Start off with an empty string and append the first zero line to it. If the next line is a one line, give it a new line in the same record. If the next line is not a one, concatenate it to the one line. No collections will have to be made, except with something weird like a so-called bi-directional pointer where two isolated parts of the file say the same thing and you have to make sure they agree with each other. There's a whole thread on the redundant FAMS and FAMC tags.
I have no idea whether any of that is practical. Since it would be writing lines to a file one at a time and appending the lines, I'm afraid that it would be faster to split the lines, put them into a list or a bunch of lists or dicts. However, the general inspiration might be worth pursuing. If you have to read the GEDCOM file line by line, it would be better to do that only once. Nesting lists and dicts is OK if you get it right and the resulting structure matches the database structure in some useful way. The thing to watch is, if you're gonna nest, then mindlessly nesting exactly what the GEDCOM line numbers suggest would be a mistake. The nesting has to reflect 1) what is a primary key record in the database, and 2) what is a foreign key record in the database.
|
|
|
Post by Uncle Buddy on May 15, 2022 8:05:25 GMT -8
Here are 782 lines of GEDCOM transformed into 64 lines--one line for each primary key in the database--in a split second. This isn't exactly what I was looking for yet, but with the line numbers replaced by separators that denote nesting levels, this might be the start of something good. I don't know why the TRLR line didn't write, there should be 65 lines.
0 HEAD 1 SOUR FAMILY_HISTORIAN 1 FILE D:\genealogy databases\gedcom files\todd_boyett_connections.ged 1 GEDC 1 CHAR UTF-8 0 @I2@ INDI 1 NAME Jupiter /Flood/ 1 FAMC @F2@ 1 FAMS @F1@ 1 SEX M 0 @I3@ INDI 1 NAME Neptune /Flood/ 1 FAMC @F2@ 1 FAMS @F1@ 1 SEX F 0 @I4@ INDI 1 NAME David /Flood/ 1 SEX M 1 RESI 1 RESI 1 RESI 1 FAMC @F2@ 1 SOUR @S2@ 1 CHAN 0 @I5@ INDI 1 NAME Samuel F. /Flood/ 1 SEX M 1 BIRT 1 RESI 1 OCCU retired Baptist minister 1 OCCU carpenter 1 RELI First Baptist Church of Meridian 1 RELI Baptist Brotherhood 1 DEAT 1 BURI 1 FAMS @F2@ 1 FAMS @F5@ 1 SOUR @S2@ 1 CHAN 0 @I6@ INDI 1 NAME Nora Naomi /Mills/ 1 SEX F 1 RESI 1 DEAT 1 BURI 1 FAMC @F7@ 1 FAMS @F2@ 1 SOUR @S2@ 1 CHAN 0 @I7@ INDI 1 NAME James /Flood/ 1 SEX M 1 EVEN 1 RESI 1 FAMC @F2@ 1 SOUR @S2@ 1 CHAN 0 @I8@ INDI 1 NAME Arthur /Flood/ 1 SEX M 1 RESI 1 RESI 1 FAMC @F2@ 1 SOUR @S2@ 1 CHAN 0 @I9@ INDI 1 NAME Curtis /Flood/ 1 SEX M 1 RESI 1 RESI 1 FAMC @F2@ 1 SOUR @S2@ 1 CHAN 0 @I10@ INDI 1 NAME Joe /Flood/ 1 SEX M 1 RESI 1 RESI 1 FAMC @F2@ 1 SOUR @S2@ 1 CHAN 0 @I11@ INDI 1 NAME Elsie /Flood/ 1 SEX F 1 RESI 1 RESI 1 FAMC @F2@ 1 FAMS @F3@ 1 SOUR @S2@ 1 CHAN 0 @I12@ INDI 1 NAME _____ /Mixon/ 1 SEX M 1 FAMS @F3@ 1 SOUR @S2@ 1 CHAN 0 @I13@ INDI 1 NAME Helen /Flood/ 1 SEX F 1 RESI 1 RESI 1 FAMC @F2@ 1 FAMS @F4@ 1 SOUR @S2@ 1 CHAN 0 @I14@ INDI 1 NAME _____ /Creel/ 1 SEX M 1 FAMS @F4@ 1 SOUR @S2@ 1 CHAN 0 @I15@ INDI 1 NAME _____ /_____/ 1 SEX F 1 FAMS @F5@ 1 SOUR @S2@ 1 CHAN 0 @I16@ INDI 1 NAME Annie Mae /Flood/ 1 SEX F 1 RESI 1 RESI 1 FAMC @F5@ 1 FAMS @F6@ 1 SOUR @S2@ 1 SOUR @S2@ 1 CHAN 0 @I17@ INDI 1 NAME _____ /Graham/ 1 SEX M 1 FAMS @F6@ 1 SOUR @S3@ 1 CHAN 0 @I18@ INDI 1 NAME William Milton /Mills/ 1 SEX M 1 OCCU wood dealer 1 DEAT 1 FAMS @F7@ 1 CHAN 0 @I20@ INDI 1 NAME Lilly Ruth /Mills/ 1 SEX F 1 RESI 1 FAMC @F7@ 1 FAMS @F8@ 1 SOUR @S3@ 1 CHAN 0 @I21@ INDI 1 NAME _____ /Reynolds/ 1 SEX M 1 FAMS @F8@ 1 SOUR @S3@ 1 CHAN 0 @I22@ INDI 1 NAME Opal Lee /Mills/ 1 SEX F 1 RESI 1 FAMC @F7@ 1 FAMS @F9@ 1 SOUR @S3@ 1 CHAN 0 @I23@ INDI 1 NAME _____ /Robinson/ 1 SEX M 1 FAMS @F9@ 1 SOUR @S3@ 1 CHAN 0 @I24@ INDI 1 NAME Lula /Hardin/ 1 SEX F 1 FAMS @F7@ 1 SOUR @S4@ 1 CHAN 0 @I25@ INDI 1 NAME Willie /Mills/ 1 SEX F 1 FAMC @F7@ 1 SOUR @S4@ 1 CHAN 0 @I26@ INDI 1 NAME Woodrow /Mills/ 1 SEX M 1 FAMC @F7@ 1 SOUR @S4@ 1 CHAN 0 @I27@ INDI 1 NAME Anna /Mills/ 1 SEX F 1 FAMC @F7@ 1 SOUR @S4@ 1 CHAN 0 @I28@ INDI 1 NAME Vera /Mills/ 1 SEX F 1 FAMC @F7@ 1 SOUR @S4@ 1 CHAN 0 @I29@ INDI 1 NAME Lora Grace /Flood/ 1 SEX F 1 BIRT 1 RESI 1 DEAT 1 BURI 1 FAMC @F2@ 1 FAMS @F10@ 1 SOUR @S5@ 1 CHAN 0 @I30@ INDI 1 NAME _____ /Smith/ 1 SEX M 1 FAMS @F10@ 1 SOUR @S2@ 1 CHAN 0 @I31@ INDI 1 NAME Bobbie Jean /Smith/ 1 SEX F 1 RESI 1 FAMC @F10@ 1 SOUR @S2@ 1 CHAN 0 @I32@ INDI 1 NAME Grace Patsy /Smith/ 1 SEX F 1 RESI 1 FAMC @F10@ 1 SOUR @S2@ 1 CHAN 0 @I33@ INDI 1 NAME Samuel F. /Flood/ 1 SEX M 1 RESI 1 FAMC @F2@ 1 SOUR @S2@ 1 CHAN 0 @F1@ FAM 0 @F2@ FAM 1 HUSB @I5@ 1 WIFE @I6@ 1 CHIL @I4@ 1 CHIL @I7@ 1 CHIL @I8@ 1 CHIL @I9@ 1 CHIL @I10@ 1 CHIL @I11@ 1 CHIL @I13@ 1 CHIL @I29@ 1 CHIL @I33@ 1 SOUR @S2@ 1 CHAN 0 @F3@ FAM 1 HUSB @I12@ 1 WIFE @I11@ 1 SOUR @S2@ 1 CHAN 0 @F4@ FAM 1 HUSB @I14@ 1 WIFE @I13@ 1 SOUR @S2@ 1 CHAN 0 @F5@ FAM 1 HUSB @I5@ 1 WIFE @I15@ 1 CHIL @I16@ 1 SOUR @S2@ 1 CHAN 0 @F6@ FAM 1 HUSB @I17@ 1 WIFE @I16@ 1 SOUR @S3@ 1 CHAN 0 @F7@ FAM 1 HUSB @I18@ 1 WIFE @I24@ 1 CHIL @I6@ 1 CHIL @I20@ 1 CHIL @I22@ 1 CHIL @I25@ 1 CHIL @I26@ 1 CHIL @I27@ 1 CHIL @I28@ 1 CHAN 0 @F8@ FAM 1 HUSB @I21@ 1 WIFE @I20@ 1 SOUR @S3@ 1 CHAN 0 @F9@ FAM 1 HUSB @I23@ 1 WIFE @I22@ 1 SOUR @S3@ 1 CHAN 0 @F10@ FAM 1 HUSB @I30@ 1 WIFE @I29@ 1 CHIL @I31@ 1 CHIL @I32@ 1 SOUR @S2@ 1 CHAN 0 @S2@ SOUR 1 TITL Biloxi Daily Herald, Gulfport-Biloxi-Mississippi Coast 1 CHAN 0 @S3@ SOUR 1 TITL newspaper clipping 1 CHAN 0 @S4@ SOUR 1 TITL Daniel Thornton 1 CHAN 0 @S5@ SOUR 1 TITL Salem Cemetery, transcriptions by Bennie and Lance White, February 1999 1 CHAN 0 @P1@ _PLAC Meridian, Lauderdale County, Mississippi 1 CHAN 0 @P2@ _PLAC Memorial Hospital, Gulfport, Harrison County, Mississippi 1 CHAN 0 @P3@ _PLAC 544 Camp Avenue, Gulfport, Harrison County, Mississippi 1 CHAN 0 @P4@ _PLAC Butler, Choctaw County, Alabama 1 CHAN 0 @P5@ _PLAC San Diego, California 1 CHAN 0 @P6@ _PLAC Gulfport, Harrison County, Mississippi 1 CHAN 0 @P7@ _PLAC Elizabeth City, Pasquotank County, North Carolina 1 CHAN 0 @P8@ _PLAC Bessemer, Jefferson County, Alabama 1 CHAN 0 @P9@ _PLAC Bay Springs, Jasper County, Mississippi 1 CHAN 0 @P11@ _PLAC Salem Methodist Church Cemetery, Quitman, Clarke County, Mississippi 1 NOTE a.k.a. Brashier Community Methodist Church 1 CHAN 0 @P12@ _PLAC Salem Cemetery 1 CHAN 0 @P13@ _PLAC Pass Christian, Harrison County, Mississippi 1 CHAN 0 @P14@ _PLAC Belle Glade, Palm Beach County, Florida 1 CHAN 0 @P15@ _PLAC Moss, Jasper County, Mississippi 1 CHAN 0 @P16@ _PLAC Quitman, Clarke County, Mississippi 1 CHAN 0 @P17@ _PLAC Vossburg, Jasper County, Mississippi 1 CHAN 0 @P18@ _PLAC Montrose, Jasper County, Mississippi 1 CHAN 0 @P19@ _PLAC Detroit, Michigan 1 CHAN
Here's the rudimentary code for this quick test:
# gedcom_concat_test.py
import sqlite3
current_file = "d:/treebard_gps/app/python/test.db"
class GedcomImporter(): def __init__(self, import_file): self.conn = sqlite3.connect(current_file) self.cur = self.conn.cursor() self.read_gedcom(import_file) self.cur.close() self.conn.close()
def read_gedcom(self, file): """ The `encoding` parameter in `open()` strips `` from the front of the first line. """ f = open(file, "r", encoding="utf-8-sig") g = open('d:/treebard_gps/app/python/transformed.ged', 'w')
blank = [] for line in f.readlines(): line = line.rstrip("\n") if line.startswith("0"): new_line = " ".join(blank) g.write('{}\n'.format(new_line)) blank = [] blank.append(line) elif line.startswith("1"): blank.append(line) g.close() f.close()
if __name__ == "__main__":
test_tree = "d:/treebard_gps/etc/todd_boyett_connections_fixed.ged"
GedcomImporter(test_tree)
|
|
|
Post by Uncle Buddy on May 17, 2022 4:09:51 GMT -8
[Back to normal GEDCOM lines...] How to crawl through these lines and link each succeeding line to the correct previous line, forming branches of hierarchic or nested links. It's been driving me crazy, because it's simple to read and understand visually. The nested structure dictated by the number at the front of each line of GEDCOM--a number I call n--is a structure a genealogist would have thought up. It resembles generations of people. But it's not that easy to translate to code.
Or is it?
I think I've finally had a realization of the right kind, the kind that should simplify the code. Here it is in a nutshell.
For each line, there are only two choices, the same two choices for each line.
Goal: convert GEDCOM's line numbers to real branches What to do for each line, depending on n there are only 2 reactions, either:
This isn't about the data, it's about structuring the data relationships without going down a rabbit hole. Here are some sample successive lines with the correct response to each listed after the `|||`.
I've had a strong feeling all along that there's a simple way to convert these lines to a real hierarchy, so every time I get bogged down, I just start over. Hopefully the realization that there are only two choices will speed things up. I already know how to get the data into the database, that's the easy part.
|
|
|
Post by Uncle Buddy on May 19, 2022 4:14:49 GMT -8
Well that was fun but it will mean detecting nested structures with regex. Which in itself would be OK, not ideal but doable. But that means looping through every character of every line. Long notes for example, looking for a bracket character or whatever character is chosen to denote a nesting. If that was the only way or the best way, all right, I'd learn how to do it. But it's not.
Nested structuring already exists in lists of lists, lists of dicts, nested dictionaries, etc. Looping over these elements does not involve reading what's in them one character at a time.
So it's back to the original idea of creating an outer collection with an inner collection for each zero line. The inner collections are to house lines numbered 2 thru 99, also with nesting to denote the hierarchy denoted by line numbers in GEDCOM.
No doubt everybody else knows how to do this but me, but I so enjoy not reading other peoples' code, and there is not a tutorial anywhere on earth about how to write a GEDCOM import program. Which is no doubt a conspiracy, but who cares?
Time do do my favorite thing: brainstorm. I like to start with an extreme notion and hack pieces off of it till it's doable and/or practical, then figure out whether it will be practical, flexible, workable, and acceptable to everyone involved.
With those criteria, the only solution is to give up. So my real goal is to have fun and do my best.
The extreme and easiest wrong way to do this would be to start with an empty structure that already has 99 nested empty lists. That would be silly, so the first step might be to read the first character of each GEDCOM line and find out what the deepest nesting is going to be. This could also be done on a per-record basis, so each record has a max_depth value.
But I think a more realistic strategy is something based on this:
pseudocode: if int(next_n) == int(this_n): sibling = True elif int(next_n) < int(this_n): subordinate_to_previous_line = True elif int(next_n) > int(this_n): subordinate
The question is, how to either 1) read through the .ged file twice in parallel, with one reader ahead of the other by one line (wrong), or 2) use instance variables to remember values needed from previous lines (right). Number 2 keeps dipping into the fog and later resurfacing. I'm sure it is the key to everything.
It isn't strictly necessary to gauge depth at all. Eventually there will be a right way that is not so klunky. Assuming this is true, it would be best to not do it wrong first, but instead go straight to the good stuff. Especially since it's morning here and I'm not seeing double yet. Of course my best ideas come late at night when I can no longer see the screen and have to rely on my imagination to come up with ideas that will be obviously nutballs in the morning. It would be more productive to watch B movies on Netflix at night, but I won't do that alone and sometimes my son has something better to do.
So anyhow, let's go straight to the good stuff.
|
|
|
Post by Uncle Buddy on May 19, 2022 4:15:15 GMT -8
Using instance variables to keep track of hierachical flip-flops in successive lines of GEDCOM:
line 43 line: 0 @I4@ INDI line 52 self.prior_level: 1 line 53 self.this_level: 0 line 43 line: 1 NAME David /Todd/ line 52 self.prior_level: 0 line 53 self.this_level: 1 line 43 line: 2 SOUR @S2@ line 52 self.prior_level: 1 line 53 self.this_level: 2 line 43 line: 3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary line 52 self.prior_level: 2 line 53 self.this_level: 3 line 43 line: 1 SEX M line 52 self.prior_level: 3 line 53 self.this_level: 1 line 43 line: 1 RESI line 52 self.prior_level: 1 line 53 self.this_level: 1 line 43 line: 2 DATE 1962 line 52 self.prior_level: 1 line 53 self.this_level: 2
But what to do with this useful-looking index? Keeping track of the most recent one-line, zero-line, two-line, etc. is easy. But what are those lines? They're sublists in a list, for example index 0 in self.lines (ignoring the HEAD for now since it doesn't need to be saved in the main list of lines) might be [0, '@i1@', 'INDI'] and index 1 might be [1, 'NAME', 'Andy /Kaufman/'].
Based on prior stabs at it, I could easily create a new person with ID #1 while parsing line 0. Then I'd create a record in the name table of the database while parsing line 1, but to do this I'd have to have a way of remembering what was in line 0, because the name table needs person_id 1 as a foreign key. this isn't very hard, but it gets more complicated because lines I haven't read yet will need to be linked to the line currently being processed, and I can't remember something I haven't seen yet. And there's the nuisance of needing to create a record in the database while parsing one line, then while parsing a subsequent line, having to go in and update the new line. It would be more efficient to transact with the database once per line. Similarly, PAGE tags (for citations) are subordinate to SOUR tags (for sources), but they're on different lines so I have to remember the source_id and use it as a foreign key in the subsequent citation record in the database.
So I can easily access an index which tells me which line the current line is linked to. Say I'm in a line [3, 'PAGE', 'chapter 2']. It's easy to tell that this line will be linked to a line [2, 'SOUR', 'Royal Families of Ancient Texas'], but what do I do with this information?
My current brainstorm is that while parsing the earlier line, I should either be appending new info to it which will be useful to its subordinate lines, or else creating a new list with parallel indexing. Then, the index will give me instructions on what to do with the data in the current line. This seems more efficient than having to parse the old line over again. Or creating separate collections for each GEDCOM record (everything between the zero lines is one record) and then writing a lot of if/elifs to take various actions depending on what the current tag and the referenced tag are. Or creating complex nested structures which, in spite of their complexity, do not correspond to the structure of the target database. I've tried all these things.
For example, assume that data will be input to the database on the fly, and then if the GEDCOM doesn't assign the primary key (names for example), the pk assigned by the database can be saved for use as a foreign key by a subordinate line:
self.lines = [[0, '@I1@', 'INDI'], [1, 'NAME', 'Andy /Kaufman/'], [2, 'SOUR', @S5@], [3, 'PAGE', 'chapter 2']] self.instrux = [{'pk_person': 1}, {'pk_name': 6}, {'fk_source': 5}, {'pk_cite': 17}]
In this system, I don't care whether the PAGE line has a subordinate or not. If it does, the primary key (17) assigned by the database is waiting to be used.
This seems like a workable system, simple and straightforward, but is the info saved enough to do the job? I'll have to look at a more complete section of GEDCOM, to see if this idea is worth a try as is, or if it has to be expanded, or just shipped off to the graveyard-for-cute-but-useless ideas.
|
|
|
Post by Uncle Buddy on May 19, 2022 4:15:45 GMT -8
OK, I tried looking through a more detailed record from a real GEDCOM, and it should work. But I remembered what my objection was to the way SOUR and PAGE tags are arranged in the GEDCOM hierarchy. First let's look at the whole record. 0 @I4@ INDI 1 NAME David /Todd/ 2 SOUR @S2@ 3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary 1 SEX M 1 RESI 2 DATE 1962 2 PLAC Meridian, Lauderdale County, Mississippi 2 SOUR @S2@ 3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary 1 RESI 2 DATE 1962 2 PLAC Gulfport, Harrison County, Mississippi 2 SOUR @S2@ 3 PAGE 26 June 1962, page 2, obituary, "Mrs. Lora Grace Jones" 1 RESI 2 DATE 1985 2 PLAC Gulfport, Harrison County, Mississippi 2 SOUR @S3@ 3 PAGE "MRS. NORA NAOMI TODD", obituary, date and place of publication unknown 2 SOUR @S4@ 3 PAGE findagrave member no. 48611078 1 FAMC @F2@ 1 SOUR @S2@ 2 PAGE 10 September 1962, page 2, Samuel F. Todd obituary 1 CHAN 2 DATE 17 AUG 2015 3 TIME 20:56:36 The only exception I see to the desired symmetrical, predictable pattern to be applied to each line of GEDCOM is when sources are linked to names in the GEDCOM. There's nothing to prevent developers from modeling GEDCOM when designing the data structure for their app. But Treebard, being a showcase for desirable genieware functionalities, is modeled on how I perceive real world data relationships to work, not on how GEDCOM works. When they do finally cancel GEDCOM, do I want version 17 of my app to still be modeled after GEDCOM? No, so I should not be modeling version 0 on GEDCOM either. Nor version -1. Sources should be referenced by citations. Names should be referenced by citations. There's no practical reason to link sources to names. So is the new brainstorm flexible enough to instruct a subordinate line of GEDCOM with an indirect link? Here's the solution. Behold the pertinent lines, with the corresponding self.instrux elements after the `|||`: 1 NAME David /Todd/ ||| {'name_id': 13} 2 SOUR @S2@ ||| {'source_id': 2, 'name_id': 13} 3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary ||| {'citation_id': 92} Based on these instructions, it seems to me like it will be easy for a conditional to say to the citation, "Use the name_id," and the irrelevant source_id will be ignored. The name_id can be saved while processing the source line because the name_id is accessible from the source_line. By using a dict, we are being descriptive enough that the way forward should be more or less obvious. I'm gonna try it. Which is not to say that GEDCOM's way of handling citations isn't one of its worst features. Based on an extremely cursory examination of the web pages below, I'd guess offhand that the only way to improve something so inadequate and wrong is with very complicated solutions. On the other hand, what's needed is a solution that matches reality, without an artifice like GEDCOM sitting between our actual goals and our data. www.beholdgenealogy.com/blog/?p=1395fhiso.org/pipermail/sources-citations_fhiso.org/2015-May/000103.htmlevidentiasoftware.com/citations-and-the-gedccom/tng.lythgoes.net/wiki/index.php/GEDCOMwww.fhug.org.uk/forum/viewtopic.php?t=5506
|
|
|
Post by Uncle Buddy on May 19, 2022 4:17:11 GMT -8
I've tested my idea and it works. You can in fact save a reference to data known in previous lines, so that when you get to the line where you have the data you need in order to update the database, you will be able to do it, including data from previous lines. This is the closest I've been able to get to inputting data on the fly for everything. It gets complicated when GEDCOM either doesn't agree with the world on how items of information are related to each other, or when GEDCOM doesn't agree with my genieware on this point.
Take citations, for example. I'm working with old GEDCOM in which the names got linked to sources somehow. There's nothing wrong with linking citations to names, and I probably did it on purpose back when I created the tree. But the GEDCOM has got it backwards, linking sources to names (wrong) and citations to sources (right). This is a failing of the GEDCOM line-numbering hierarchy of links system. Real links between items of info don't happen in a flat, one-dimensional, simplistic way such as this:
0 @I4@ INDI 1 NAME David /Todd/ 2 SOUR @S2@ 3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary ... 0 @S2@ SOUR 1 TITL Biloxi Daily Herald, Gulfport-Biloxi-Mississippi Coast
If GEDCOM were trying to match Treebard (which I'd like to think tries to match the real world), the links might look more like this:
0 @I4@ INDI ... 0 @N98@ NAME 1 INDI @I4@ 1 TEXT David Todd/Todd, David 2 TYPE birth name ... 0 @L22@ LINK 1 NAME @N98@ 1 CITE @C6@ ... 0 @S2@ SOUR 1 TITL Biloxi Daily Herald, Gulfport-Biloxi-Mississippi Coast ... 0 @C6@ CITE 1 TEXT 10 September 1962, page 2, Samuel F. Todd obituary
This points out another shortcoming of GEDCOM. It expects one name per individual, which I corrected to a one-to-many situation where the ONE side of the relationship is represented by a foreign key in the MANY side. Since one person can have many names, the person_id is used in the name table. The GEDCOM "specification" has an off-the-wall suggestion that the first name listed might be the "preferred" name. This exposes GEDCOM for what it is: not a standard and not a specification, it is at best a suggestion.
My suggested (off-the-cuff, not final) structure reflects the fact that a citation must be linkable to any number of other elements. In order for this to happen, SQL teaches us that the citation should be saved exactly once, and given its own primary key, a unique ID number in its own table. One of GEDCOM's worst shortfallings, and a reason it is so unmalleable, and a reason dev-users are prone to invent a lot of custom tags that others have to fiddle around with, is that it only has a few zero-level tags: "HEAD", "TRLR", "FAM", "INDI", "OBJE", "NOTE", "REPO", "SOUR", "SUBM" (version 5.5.1).
Now don't get me started. `FAM` shouldn't exist. `HEAD` and `TRLR` are not about the data, so including them as extra records in the file makes the importer of GEDCOM do extra work to ignore them 99% of the time. SUBM is also about the metadata, not the data. But I'll try to stick to the immediate topic.
The LINK tag I've invented above is meant to indicate a many-to-many relationship. Each name can be linked to many citations, and each citation can be linked to many names. Because of this fact of life, Treebard has a database table called `links_links` in which any element can be linked to any other element. And anything that is an element of genealogy has its own table, like name and citation. The zero-level tags in GEDCOM should include all elements. One thing elements can do, it seems, is pop up repeatedly. A single census citation can be linked to dozens of events, names, places... and one source. The source_id is a foreign key in the citation table, so the source foreign key line should not be first. Doing things backwards does not endear me to GEDCOM, no, no, not one little bit. The citation should be created in its own table with a primary key, and why shouldn't the related source_id foreign key be on the same line with the pk, since they go into the same record in the citation table?
I've spent a lot of time in the past few weeks learning more about GEDCOM than I ever thought I would. I find GEDCOM easy to read with my eyes and hard to read with my computer. Well, the computer has no trouble reading it, but the arrangement of the data and the dearth of obviously needed tags is a big problem that never goes away. It's fun for a while, but do I want this to be my new life? GEDCOM, unlike human languages, should be regular, complete, and predictable. When your 10-year-old pocket calculator dies, do you go to a faith healer to try and bring it back to life? No, you replace it; it's cheaper than fixing it.
The upshot of this investigation is that my time might be better spent creating a new GEDCOM-like tool that reflects how databases actually work, as I've tried to show above. By "better spent" I mean something I should be doing instead of trying to figure out how people use GEDCOM. It's clear that people tend to misuse GEDCOM most routinely, for the simple reason that GEDCOM is easier to misuse than it is to use. And since it's text and not code, it doesn't complain or refuse to let you keep misusing it. But there's something worse than this: GEDCOM is simplistic, useless twaddle. (Useless-as-is... is useless.) It was created to support toy family trees back in the olden days of the personal computer. Maybe they weren't thinking about backwards compatibility back then, just whatever works for now, let's make that our standard, and everyone can conform to us, till hell freezes over. I don't know. I don't want to point my finger, it might get chopped off. But I might want to abandon the quest to get Treebard to import GEDCOM, and instead work on a practical project that could go somewhere, such as replacing GEDCOM.
I've been adamant in my suggesting that the correct replacement would be a SQLite database which all genieware providers would commonly use, while retaining their brand's individuality by way of their respective GUIs and features lists. Well let me brainstorm a bit if you don't mind.
So... "...the correct replacement would be a SQLite database that all genieware providers would commonly use..." What if that suggestion of mine were incredibly far-fetched, something that will never happen? Or even wrong in some way, I mean impossible?
Well, in that case, my time would be better spent creating a text-file-based replacement for GEDCOM, because that's what people expect and what they seem willing to use.
I'll think about it.
It would not speed things up much, in fact it might slow things down. SQL is many times faster than parsing text. But there has to be willingness on the part of everybody to work hard and make sacrifices and even compromises, or the solution will remain forever beyond arm's length, and it will always be someone else's fault. Maybe asking all vendors to adopt the same data structure would be laughed off the table with nary a glance. I really don't know.
Due to my radical lack of programming experience, zero computer science background, and my total lack of political connections in the genealogy community and the genieware industry, I might actually be the right person for the job. If you're gonna do this import/export business with a text file of all things, then someone like me who is under-educated might hit upon something that amateur developers would be able to understand well enough to use correctly. Having downloaded dozens of genieware trials, the nicest thing I can say is that nearly all of them appear to have been created by amateurs.
OK... all of them. Except Genbox, which I like, and Gramps which I want to like. And very few others.
I just took a quick peek at "GEDCOM X" which is not GEDCOM. It appears to be a grandiose abstraction. Hells bells, fellas, genealogy is a hobby for crikey's sake. Who are we trying to fool? I think most developers do not look forward to getting a PhD in whatever replaces GEDCOM. I'm almost sure of it. They want a tool they can grab off the shelf and just use.
I can hear all you lurkers on this forum or blog or whatever it is. I can hear you breathing in the dark, hollering in whispers: "Do it! The job is yours! Make my years of dogged research worth something!"
Well, thanks for the offer. I'll think about it.
Here are the results from my first successful input of citations into Treebard's database from a real GEDCOM file. In case you don't know what's wrong with this: the citation text should be stored once and only once in the citation table. Therefore the citation_ids in the links_links table should not all be different. There should only be a few of them, repeated.
So I'm at a crossroads: do I continue beating my head against a tool that was not much good decades ago when it was invented? Or do I instead try my hand at replacing that tool with a better one?
sqlite> select * from citation; citation_id|source_id|citations 1|2|10 September 1962, page 2, Samuel F. Todd obituary 2|2|10 September 1962, page 2, Samuel F. Todd obituary 3|2|10 September 1962, page 2, Samuel F. Todd obituary 4|2|10 September 1962, page 2, Samuel F. Todd obituary 5|4|findagrave member no. 48611078 6|4|findagrave member no. 48611078 7|2|10 September 1962, page 2, Samuel F. Todd obituary 8|2|10 September 1962, page 2, Samuel F. Todd obituary 9|2|10 September 1962, page 2, Samuel F. Todd obituary 10|2|10 September 1962, page 2, Samuel F. Todd obituary 11|2|10 September 1962, page 2, Samuel F. Todd obituary 12|2|10 September 1962, page 2, Samuel F. Todd obituary 13|2|10 September 1962, page 2, Samuel F. Todd obituary 14|2|10 September 1962, page 2, Samuel F. Todd obituary 15|2|10 September 1962, page 2, Samuel F. Todd obituary 16|2|10 September 1962, page 2, Samuel F. Todd obituary 17|3|"MRS. NORA NAOMI TODD", obituary, date and place of publication unknown 18|4|findagrave member no. 48611078 19|3|"MRS. NORA NAOMI TODD", obituary, date and place of publication unknown 20|4|findagrave member no. 48611078 21|3|"MRS. NORA NAOMI TODD", obituary, date and place of publication unknown 22|3|"MRS. NORA NAOMI TODD", obituary, date and place of publication unknown 23|4|findagrave member no. 48611078 24|3|"MRS. NORA NAOMI TODD", obituary, date and place of publication unknown 25|4|findagrave member no. 48611078 26|4|findagrave member no. 48611078 27|4|findagrave member no. 48611078 28|4|findagrave member no. 48611078 29|4|findagrave member no. 48611078 30|2|26 June 1962, page 2, obituary, "Mrs. Lora Grace Smith" 31|2|26 June 1962, page 2, obituary, "Mrs. Lora Grace Smith" 32|2|26 June 1962, page 2, obituary, "Mrs. Lora Grace Smith" 33|2|26 June 1962, page 2, obituary, "Mrs. Lora Grace Smith" 34|2|26 June 1962, page 2, obituary, "Mrs. Lora Grace Smith" sqlite> select * from links_links; links_links_id|person_id|places_places_id|name_id|source_id|citation_id|finding_id|claim_id|project_id|to_do_id|contact_id|image_id|note_id|role_id|repository_id|report_id|INTEGER|chart_id|media_id 1|||2530||1||||||||||||| 2|||2533||2||||||||||||| 3|||2534||3||||||||||||| 4|||2535||4||||||||||||| 5|||2535||5||||||||||||| 6|||2535||6||||||||||||| 7|||2536||7||||||||||||| 8|||2537||8||||||||||||| 9|||2538||9||||||||||||| 10|||2539||10||||||||||||| 11|||2540||11||||||||||||| 12|||2541||12||||||||||||| 13|||2542||13||||||||||||| 14|||2543||14||||||||||||| 15|||2544||15||||||||||||| 16|||2545||16||||||||||||| 17|||2546||17||||||||||||| 18|||2547||18||||||||||||| 19|||2548||19||||||||||||| 20|||2548||20||||||||||||| 21|||2549||21||||||||||||| 22|||2550||22||||||||||||| 23|||2550||23||||||||||||| 24|||2551||24||||||||||||| 25|||2552||25||||||||||||| 26|||2553||26||||||||||||| 27|||2554||27||||||||||||| 28|||2555||28||||||||||||| 29|||2556||29||||||||||||| 30|||2557||30||||||||||||| 31|||2558||31||||||||||||| 32|||2559||32||||||||||||| 33|||2560||33||||||||||||| 34|||2561||34|||||||||||||
|
|