Some essential dev strategies (GEDCOM import app)
Nov 5, 2023 18:54:41 GMT -8
Post by Uncle Buddy on Nov 5, 2023 18:54:41 GMT -8
Having started over on a GEDCOM import program probably two dozen times, I feel that I'm uniquely qualified to have opinions about GEDCOM. But not all my opinions are negative. I don't ONLY play the critic; I've also done the work. My import program is nearly finished for the ~25th(?) time, and here are my notes for how I will proceed with the detail work, now that I've got some idea of what the broad strokes must entail in order for me to create a finishable import app. Which I've theoretically done. Theoretically, all that's left is to fill in a small, medium or large procedure to handle each GEDCOM tag, depending on the tag. The procedures have already been written in various unfinished versions, but what stopped me from finishing the last version was that I had tried to glue together the broad strokes, the framework of the machine, from various previous versions, but then I had tried to fill in the details before I understood how the broad strokes were working together, if at all. The current framework for the app was created more carefully, the broad strokes were put together simply and are well understood, so here are some of the key concepts I plan to employ in finishing this version, if possible.
If not reasonably possible, I'll probably start over again.
The INDI record is the beast, because it contains many sub-records. It occurred to me recently that an inspection of the GEDCOM 5.5.5 or 5.5.1 specs reveals that there are a finite number of sub-record types that can occur within an INDI record. Each of these sub-records is signalled by the occurrence of one of these lines (not counting lines I intend to ignore; my app is open source so more features can be added if you care about the data I intend to ignore):
Below each main sub-record beginning line (a line that starts with a "1"), there can be lines that start with a "2", "3", "4" etc. These lines fall within the current sub-record.
The most complex sub-record is the event sub-record. The event has to be assigned an ID, and several elements within the sub-record need to be assigned IDs. By "ID" I mean a primary key as SQL calls it, which is basically the identifier that GEDCOM would have assigned to the element itself if GEDCOM were carefully designed to match the requirements of SQL.
Entering the event sub-record, assign a current event ID by incrementing the maximum existing event ID. This ID is valid until the end of the sub-record is detected by the next occurrence of a line starting with a "1".
Within the event sub-record, its possible sub-sub-records are signalled by any of these lines:
The multi-faceted ineptness of the GEDCOM PLAC specification requires a variety of hurdles to be jumped in order to get GEDCOM places into a UNIGEDS database. A place ID has to be created for each single place within the GEDCOM nested place string. A place name ID has to be created for the place names used in the nested place string. And in case a wondrous app like Treebard GPS comes along and plans to provide an autofill field for nested place strings, a nested place string ID has to be created. To do all this without introducing erroneous assumptions--such as wrongly split places that are the same place but called differently at different times, wrongly lumped-together places that have the same name but are not the same place--is a hurdle that can only be jumped by either 1) writing so-called smart software to guess wrongly most of the time, or 2) stop the import process, open a dialog, and ask the user to sift through all the single place names. This can and must be made easy for the user, or the user will cancel the import process. This is one of the broad strokes, and to make it work without messing up the code, it had to be done separately. Mixing this sort of complexity into the final detail work has stopped more than one attempt to write an import program.
The FAMC bag of worms is one I prefer to not discuss today, as it is particularly grotesque. I'll try to describe the right way to deal with FAMC/FAMS/HUSB/WIFE/CHIL soon, when I'm sitting in the pot of gold at the end of the rainbow with a successful procedure to show off. But let me just take this opportunity to mention that one of the broad strokes is to read through the whole .ged file and record all the primary identifiers first, before starting over and reading the non-zero lines. As a text file, GEDCOM's left hand doesn't know what its right hand is doing, so I have never seen any simple solution other than to save the primary IDs first. So, for example, the family ID will already be saved when you come across a reference to it in a second read-through.
Another broad stroke I like to do first is to handle concatenations, keyed to the line number where the concatenated text began. Because CONC and CONT lines work like nothing else in GEDCOM, it lightens up the code-reading load considerably to not mix their handling in with the rest of the code.
As for handling the sub-records of the event sub-record, it's a complex hurdle to jump because it contains so many elements that need an ID, and the relationship of these elements to each other--their cardinality--has to be understood in order to store them in a way that the data will just slide into UNIGEDS effortlessly without jumping more hurdles. But the main point of this whole chapter is simple: within a sub-record, assign an ID where needed and then (Python-wise), save that ID as an instance variable such as self.current_event_id, self.current_citation_id, self.current_source_id, etc. Why? That's the whole point: so you will have access to that value in the next line, and the next, until that sub-record is ended by the beginning of a new sub-record.
If I try to go into much more detail here, I'll use my fresh morning energy to blather and dither and froth at the mouth instead of writing the code itself.
It's going well, here at Treebard University, but I had all the windows removed in order to minimize damage when I throw furniture through them.
* (RE: putting EVEN and FACT (for example) in the same category... While it's true that events and attributes are not the same thing exactly, this doesn't mean they shouldn't be lumped together. In UNIGEDS they are lumped together, not because they are the same thing, but because they should be lumped together. Trying to keep them separate promotes application bloat, will not be understood nor appreciated in the same way by different genieware creators, and serves no practical purpose. Family Historian does something similar with its "Fact" element. For an example of pointlessly harping on the theme "events and attributes are not the same", see "Attributes do not have Age" in the 5.5.5 specs.)
If not reasonably possible, I'll probably start over again.
The INDI record is the beast, because it contains many sub-records. It occurred to me recently that an inspection of the GEDCOM 5.5.5 or 5.5.1 specs reveals that there are a finite number of sub-record types that can occur within an INDI record. Each of these sub-records is signalled by the occurrence of one of these lines (not counting lines I intend to ignore; my app is open source so more features can be added if you care about the data I intend to ignore):
1 NAME ...
1 EVEN/FACT/BIRT/DEAT/RESI/etc. ...*
1 NOTE ...
1 SEX M/F/U
1 FAMC ...
1 FAMS ...
1 OBJE ...
1 SOUR ...
Below each main sub-record beginning line (a line that starts with a "1"), there can be lines that start with a "2", "3", "4" etc. These lines fall within the current sub-record.
The most complex sub-record is the event sub-record. The event has to be assigned an ID, and several elements within the sub-record need to be assigned IDs. By "ID" I mean a primary key as SQL calls it, which is basically the identifier that GEDCOM would have assigned to the element itself if GEDCOM were carefully designed to match the requirements of SQL.
Entering the event sub-record, assign a current event ID by incrementing the maximum existing event ID. This ID is valid until the end of the sub-record is detected by the next occurrence of a line starting with a "1".
Within the event sub-record, its possible sub-sub-records are signalled by any of these lines:
2 TYPE ...
2 PLAC ...
2 AGE ...
2 DATE ...
2 CAUS ...
2 NOTE ...
2 SOUR ...
2 OBJE ...
2 FAMC ...
The multi-faceted ineptness of the GEDCOM PLAC specification requires a variety of hurdles to be jumped in order to get GEDCOM places into a UNIGEDS database. A place ID has to be created for each single place within the GEDCOM nested place string. A place name ID has to be created for the place names used in the nested place string. And in case a wondrous app like Treebard GPS comes along and plans to provide an autofill field for nested place strings, a nested place string ID has to be created. To do all this without introducing erroneous assumptions--such as wrongly split places that are the same place but called differently at different times, wrongly lumped-together places that have the same name but are not the same place--is a hurdle that can only be jumped by either 1) writing so-called smart software to guess wrongly most of the time, or 2) stop the import process, open a dialog, and ask the user to sift through all the single place names. This can and must be made easy for the user, or the user will cancel the import process. This is one of the broad strokes, and to make it work without messing up the code, it had to be done separately. Mixing this sort of complexity into the final detail work has stopped more than one attempt to write an import program.
The FAMC bag of worms is one I prefer to not discuss today, as it is particularly grotesque. I'll try to describe the right way to deal with FAMC/FAMS/HUSB/WIFE/CHIL soon, when I'm sitting in the pot of gold at the end of the rainbow with a successful procedure to show off. But let me just take this opportunity to mention that one of the broad strokes is to read through the whole .ged file and record all the primary identifiers first, before starting over and reading the non-zero lines. As a text file, GEDCOM's left hand doesn't know what its right hand is doing, so I have never seen any simple solution other than to save the primary IDs first. So, for example, the family ID will already be saved when you come across a reference to it in a second read-through.
Another broad stroke I like to do first is to handle concatenations, keyed to the line number where the concatenated text began. Because CONC and CONT lines work like nothing else in GEDCOM, it lightens up the code-reading load considerably to not mix their handling in with the rest of the code.
As for handling the sub-records of the event sub-record, it's a complex hurdle to jump because it contains so many elements that need an ID, and the relationship of these elements to each other--their cardinality--has to be understood in order to store them in a way that the data will just slide into UNIGEDS effortlessly without jumping more hurdles. But the main point of this whole chapter is simple: within a sub-record, assign an ID where needed and then (Python-wise), save that ID as an instance variable such as self.current_event_id, self.current_citation_id, self.current_source_id, etc. Why? That's the whole point: so you will have access to that value in the next line, and the next, until that sub-record is ended by the beginning of a new sub-record.
If I try to go into much more detail here, I'll use my fresh morning energy to blather and dither and froth at the mouth instead of writing the code itself.
It's going well, here at Treebard University, but I had all the windows removed in order to minimize damage when I throw furniture through them.
* (RE: putting EVEN and FACT (for example) in the same category... While it's true that events and attributes are not the same thing exactly, this doesn't mean they shouldn't be lumped together. In UNIGEDS they are lumped together, not because they are the same thing, but because they should be lumped together. Trying to keep them separate promotes application bloat, will not be understood nor appreciated in the same way by different genieware creators, and serves no practical purpose. Family Historian does something similar with its "Fact" element. For an example of pointlessly harping on the theme "events and attributes are not the same", see "Attributes do not have Age" in the 5.5.5 specs.)