|
Post by Uncle Buddy on May 9, 2022 8:15:50 GMT -8
In my current coding project meant to get GEDCOM data imported into Treebard, I've already gotten the GEDCOM pointers tagged HUSB, WIFE, and CHIL into the database. But I noticed that this only input four couples with children, while there are nine FAM tags in the .ged file.
sqlite> select person_id1, person_id2 from finding where kin_type_id1 is not null or kin_type_id2 is not null; 5|6 18|24 5|6 5|6 5|6 5|6 5|6 5|6 5|15 18|24 18|24 18|24 18|24 18|24 18|24 5|6 30|29 30|29 5|6
From above you see there are four unique sets of parents in the tree. Where does GEDCOM get 9 families from? (That's 9 x 2 = 18 FAMS tags for "spouses".)
From the .ged:
0 @F2@ FAM 1 HUSB @I5@ 1 WIFE @I6@
0 @F3@ FAM 1 HUSB @I12@ 1 WIFE @I11@
0 @F4@ FAM 1 HUSB @I14@ 1 WIFE @I13@
0 @F5@ FAM 1 HUSB @I5@ 1 WIFE @I15@
0 @F6@ FAM 1 HUSB @I17@ 1 WIFE @I16@
0 @F7@ FAM 1 HUSB @I18@ 1 WIFE @I24@
0 @F8@ FAM 1 HUSB @I21@ 1 WIFE @I20@
0 @F9@ FAM 1 HUSB @I23@ 1 WIFE @I22@
0 @F10@ FAM 1 HUSB @I30@ 1 WIFE @I29@
FROM THE set of families found by SQL query, these .ged couples are missing: 12/11 14/13 17/16 21/20 23/22
sqlite> select * from finding where person_id1 in (12, 11, 13, 14, 17, 16, 21, 20, 22, 23) or person_id2 in (12, 11, 13, 14, 17, 16, 21, 20, 22, 23); sqlite> [no results]
These are childless couples. Which reveals that GEDCOM has more than one way of deciding that a FAM tag should be created. But actually it's not GEDCOM that makes the decision. GEDCOM just transfers the decisions of one genieware to another genieware. I say "the decisions of the genieware" because in this case, it's the arrangement of the GUI that caused the creation of the families. By inputting two people into spouse fields in the GUI, a couple is born. They have no common events such as marriage or divorce; no events at all. They have no children. The GUI decided. This is not unreasonable. The purpose of the spouse inputs is clear. It allows the user to perform purely conclusion-based genealogy. Not only does the user not have to show any sources for the relationship of the partners, he also doesn't have to show any couple events or children.
So is this the real purpose of the GEDCOM `FAM` tag? To allow and encourage something even more flimsy than unsourced, conclusion-based genealogy? Something that I'd call baseless or even GUI-based genealogy. This is pretty much the same thing as conclusion-based, unsourced data entry, I just got a shock and here's the cause of my surprise: there is in fact a method to GEDCOM's madness. If I'm correct, the FAM tag allows genieware providers to partner couples up without the user declaring any event to describe the relationship. No event, no source, no reason to say it, just the juxtaposition of two input fields in a GUI. This is where both GEDCOM and many geniewares just assume spousal roles without the user's explicit permission.
Is this the right way to reflect the nature of the human world? Actually there are usually many undocumented events such as "first kiss" in the relationship leading up to the official event-asserting sources such as marriage licenses and birth certificates, and other documentable events such as having the same address in common year after year.
As a self-styled GUI creator, I have sweat blood designing a family table that shows all of the current person's partners and children at the same time. This is unique to Treebard or nearly so.
But you can't just input a partner to Treebard. The GUI's family table is populated with partners by the code's asking the database for marital events and/or children. This family table would have been so much easier to design if I'd done what the other developers have done. In Treebard you don't enter a partner first, you enter a partner event. Then the partner input appears, and you can enter a name. This is how I'm trying to create genieware that reflects the world and a database structure that reflects how relationships actually work. Actual human relations occur because one or more events add up to a relationship, not some researcher a hundred years later sitting at a computer copying someone's GEDCOM and pumping it into his own tree without doing any research, and not even because ancestry.com says it happened. Events and attributes create relationships in the real world, not some over-simplified database represented by pseudo-elements such as the GEDCOM `FAM` tag.
My conclusion is that the redundancy built into the GEDCOM structure because of the FAM, FAMC AND FAMS tags exists for no better reason than to support the convenience of software developers who don't want to do the hard work of representing the real world accurately in a database structure. I don't believe that enforcing sourcing is practical. But is it too much to ask that the user should declare some reason for saying that two people were partners?
The reason this sudden insight about GEDCOM's FAM tag causes me to recoil in horror is that it could conceivably force me to create a family table with family IDs in my database, just so I can import unsourced, eventless, childless, baseless relationships, just because other geniewares allow it.
But I can't see being forced to do this. There are no relationships that can't be described. In Treebard, you can create a marriage or partnership of some kind just by saying so, but you do have to say so. That is, you have to describe the partnership by choosing an event type or creating your own. Then the input field appears for the partner. You could even create a generic event type "relationship" if I forget to do it, for those cases where nothing specific is known. Then no one will have anything to complain about. Because it's not harder to do it this way, it just has to be done in this order: event first, partner next. Source whenever you want.
For the sake of consistency, Treebard could go one of two ways. Any FAMS tags not supported by an event or a child could be put into the exception report with instructions for the user to re-enter the event of a relationship as an event. Marital events would create the needed input for the name. These events include marriage, wedding, divorce, annulment, separation, partnership, common-law marriage, and whatever marital event types the user wants to create himself. Only one such event, or one child, is enough to get the partner input to appear so the user can enter the partner's name.
The other strategy is probably better, since the existence of the relationship is not redundant. Redundancy isn't the only issue. Five of the FAMS tags in the .ged file I'm using aren't redundant at all, but that's because they're baseless. Treebard could create the generic "relationship" event itself and then go ahead and import the couple's link.
Treebard's extra step of creating the event first is not more complicated than the usual baseless way of entering partnerships. Because oddly enough, the real way to complicate a tree element, philosophically speaking, is to oversimplify it. From the moment an oversimplified element starts floating around in a tree, ungrounded by any tie to reality, it causes confusion and uncertainty by raising questions in the user's mind every time he sees it. Then comes the real complication caused by over-simplification: all the genieware vendors do it that way because once upon a time, someone designed GEDCOM to encourage baseless couplings of individuals with that weird FAMS tag. Not that fixing GEDCOM is my goal, but it seems the participants in an event are subordinate to the event. You don't just imagine relationships into existence.
When I see a couple in a genieware unsupported by any events, children, or sources, I have to ask myself whether the genieware user was doing research or just filling in names and dates as quickly as possible to get a big tree as fast as they could. To each his own, I guess, but for me, if genealogy doesn't tell a story about human beings doing things in their world, then why bother? At that point, genealogy seems like just a compulsive sort of twitch, no more interesting than reading the phone book.
But who am I to tell anyone how to do their hobby. Some time ago, I gave up on the idea of forcing users to source their conclusions. But I can't keep quiet about flying in the face of standard programming common-sense by adding elements to the database structure which are not elements.
And I don't mind being wrong. If I am wrong, then someone with more experience at GEDCOM can tell me what the FAMS tag actually accomplishes, what it does right.
|
|
|
Post by Uncle Buddy on May 9, 2022 9:57:58 GMT -8
sqlite> select event_type_id, event_types from event_type where couple = 1; event_type_id|event_types 2|marriage 11|wedding 15|divorce 16|engagement 17|annulment 18|separation 56|betrothal 59|marriage license 81|marriage banns 85|filing for divorce 90|partnership 93|anniversary celebration 98|marriage contract 102|cohabitation 103|living together 104|wedding anniversary 105|shacking up sqlite> select event_type_id, event_types from event_type where marital = 1; event_type_id|event_types 2|marriage 11|wedding 15|divorce 17|annulment 18|separation 59|marriage license 85|filing for divorce 90|partnership 93|anniversary celebration 98|marriage contract 102|cohabitation 103|living together 104|wedding anniversary 105|shacking up sqlite> select event_type_id, event_types from event_type where hidden = 1; event_type_id|event_types 45|tobacco use 46|alcohol use 47|drug problem 105|shacking up sqlite> select * from event_type where event_type_id = 1; event_type_id|event_types|built_in|hidden|couple|after_death|marital 1|birth|1|0|0|0|0 sqlite> insert into event_type values (50, 'relationship', 1, 0, 1, 0, 0);
I went ahead and added an event type "relationship" but it's not going to work as a generic marital event since it doesn't imply marriage or marriage-like circumstances in the way that "cohabitation" does. Right now I'm thinking that the couple relationships indicated only by a FAMS tag in a GEDCOM file should be input as a "marriage". It's probably what people expect. The user can change the data after it's imported.
Treebard's event types have subtypes as shown above in the boolean columns `built_in`, `hidden`, `couple`, `after_death`, and `marital`. True is 1 and False is 0, in boolean columns. I'll describe these sub-types.
Built_in: event types that come with Treebard and can't be deleted.
Hidden: event types that can't be deleted, as well as user-defined event types which can be deleted if they're not being used in the tree, can be hidden by the user, in order to not appear in the dropdown list for the user to select from. ("Shacking up" for example, which is there to test the `hidden` functionality.)
Couple: this includes marital events but also other couple events that don't imply or resemble marriage. Used to get things right when gathering data on load that will influence how the data will display in the current person tab.
Marital: a sub-type of couple event which implies or resembles marriage. Even one of these linked to a person will add an input to the person's GUI family table to display/edit/add the name of a partner. The distinction between couple events and marital events is that only marital events will show a partner in the GUI's family table.
After_death: events that should be listed below death on the events table instead of sorted by date, even if some of the dates are wrong.
When the user creates an event type, he'll have to tell treebard which of these sub-sets of event_type the new event_type should belong to.
You might ask whether allowing the user to create types isn't the same thing as letting genieware vendors create custom GEDCOM tags when writing GEDCOM files for export. I don't think so. Treebard is not made to be exported with GEDCOM; the user is advised to share the whole program. We'll do our best to export Treebard data with GEDCOM since this service is needed, but for new trees and/or new genealogists who haven't made big trees that would have to be done over, it might make sense to just use Treebard for everything and share the whole program. With users sharing via a common database structure, the import process can add new records to type tables as needed.
Hiding and unhiding types, and creating and deleting types will be done in the types tab.
|
|
|
Post by Uncle Buddy on May 11, 2022 1:02:55 GMT -8
I was wrong about FAMS, it is in fact completely redundant and serves no purpose. It isn't needed to denote childless couples.
The FAM tag can have HUSB, WIFE and CHIL tags, and if there is no CHIL tag, then it's a childless couple. The reason I thought FAMS pointed out childless couples was that in my code, I had forgotten to input childless couples in the FAM tag. Once I fixed that, everyone was present and accounted for.
The only purpose I can find for the FAMS tag is to create a redundant situation where the person writing the code to import the GEDCOM has to make sure the redundancy was dealt with consistently. For example, if a childless couple is not listed with HUSB/WIFE tags in the FAM record, but is listed in the INDI record with FAMS tags, then the couple has to be input and/or an exception noted.
Having to check two places for the same thing is [adjective deleted]. Having to do this extra work is a function of the redundant FAM tag. What I said above still holds true in general: it's the FAM tag redundancy (the very fact that `FAM` exists) which supports and encourages baseless couples, i.e. couples that are baseless conclusions of the software design wherein the juxtaposition of partner input fields gets the user to slide two people into his tree without an event or a child to link them together as a couple. I could rewrite Treebard to do it this way, but if I did that, then I'd deserve the FAM tag's extra work load.
The FAM tag should not exist, and it should not be possible to say two people are a couple without an event or a child. This is not about sourcing, it's about baseless genealogy. Imaginary couples.
I'm not complaining about having to do the work, because I don't have to do anything. But I am complaining; that's kinda my hobby, because if I'm gonna make genieware better, I have to figure out what's wrong with it first.
And I'm still aware that I could be wrong in some practical way that actually matters, but in database design, as far as I know, there has to be an awfully good reason to input the same data twice, and I haven't found any excuse yet to require this act of denormalization.
|
|
|
Post by Uncle Buddy on May 12, 2022 22:03:49 GMT -8
All complaints aside, I've decided to figure out the basics of how to import GEDCOM come hell or high water, and the FAM tag has essential data stored in its subordinate lines even if the FAM tag itself is the wrong way to get this done. And it's possible for a genieware app to use either of the ends of the double-pointed arrow instead of shooting the arrow both directions as the creators of GEDCOM expect it to be done.
For example, the export program is supposed to put pointers to individuals in HUSB/WIFE/CHIL lines in FAM records in addition to adding FAMC/FAMS lines to INDI records. The import program needs a simple way to get all the data if this is not done the way GEDCOM expects it to be done, for example, in case FAMC/FAMS tags are added to INDI records instead of HUSB/WIFE/CHIL lines being added to FAM records.
I already tried doing this the hard way, which is good because I needed to get my rotten first start tangled to the point where I was willing to start over from scratch on the whole module. The fresh attempt is going well and the code is now easy to read, simpler and more succinct.
With a usable code base it has also occurred to me that a Python dictionary might have the right stuff for dealing with this pseudo-family element quandary in a simple, straightforward way.
INDI tags seemingly have to be input first since FAM tags make reference to them with pointers. Now here's where the redundancy I keep whinging about twists me into a pretzel. While attempting to input INDI tags we come across subordinate FAMC and FAMS tags in the INDI records which point to FAM tag identifiers.
0 @I4@ INDI 1 NAME David /Flood/ 1 FAMC @F2@
0 @I5@ INDI 1 NAME Samuel F. /Flood/ 1 FAMS @F2@ 1 FAMS @F5@
0 @I6@ INDI 1 NAME Nora Naomi /Mills/ 1 FAMC @F7@ 1 FAMS @F2@ ... 0 @F2@ FAM 1 HUSB @I5@ 1 WIFE @I6@ 1 CHIL @I4@
The FAMC and FAMS data can't be input to the database yet because these lines reference family IDs and the FAM tags haven't even been seen yet. So the pointers listed as FAMC and FAMS data while parsing INDI tags have to be stored in a Python list or something. Then later on when parsing the FAM strings, the import program has to add the individuals mentioned by the FAMC and FAMS pointers if and only the information was not also included as pointers in the FAM records. Since it's a double-pointed arrow, shooting it in either direction first seems wrong. There's a lot of truth behind such an intuition.
I said Python list above, but a dictionary might be just what the doctor ordered. Inputting redundant data to the dict will overwrite what's already there without a whimper, while inputting unique data will not overwrite anything. So we'll save the good stuff as we stumble over it, without stopping to fuss around about what is and is not redundant. Here is a blank dict and some specific goals for the final code, whatever that will look like:
(pseudocode) self.families = {INDI: [{CHIL: [], HUSB: "", WIFE: ""}]} In FAMC line, get pointer and append to CHIL value of key individual in self.families. In FAMS line, get pointer and assign to HUSB or WIFE value of key indiv. Resulting in this after reading the INDI records: self.families = {INDI: [{CHIL: [], HUSB: "", WIFE: ""}], INDI: [{CHIL: [], HUSB: "", WIFE: ""}],}
A little voice is telling me that on closer inspection, the two ends of the double-pointed arrow have to be reconciled somehow so that when we get to the FAM tag, the data goes into the right family. The FAM pointers might be used for that, but only because the information has been duplicated and stored in two different places. (See below.)
The above is preliminary. I won't bother you with the rest of the conjecture, but this train of thought will lead to a functional dict for storing data accurately, completely, and non-redundantly, so that when the FAM, FAMC, AND FAMS tags' data have all been incorporated into the dict, it will be a simple matter to input everything to the database with a few `INSERT` and/or `UPDATE` statements.
Just for completeness, this is how GEDCOM should record this family, non-redundantly, straightforwardly, in order to mesh nicely with the universal goal of programmers to not do the same thing twice nor make complex tasks more complicated than they already are:
0 @I4@ INDI 1 NAME David /Flood/ 1 PARENTS @I5@, @I6@
0 @I5@ INDI 1 NAME Samuel F. /Flood/ 1 PARTNER @I6@ 1 PARTNER @I15@
0 @I6@ INDI 1 NAME Nora Naomi /Mills/ 1 PARENTS @I18@, @I24@ 1 PARTNER @I5@
The above conveys more information in eleven lines of GEDCOM and four tags (INDI, NAME, PARENTS, PARTNER) than the official version conveys in fifteen lines and seven tags (INDI, FAM, FAMC, FAMS, HUSB, WIFE, CHILD).
Unlike the nuclear family model of reality supported, reflected and enforced by GEDCOM, the latter rendition avoids the following mistakes fostered by GEDCOM's structure: --Having children together doesn't mean you're husband and wife. --There are no compound elements (for the code to deal with), e.g. "couple" or "family", just the one that's needed: "individual". --If I say who your parents are, I don't have to also remember to say you are their child.
Nothing is missing from the latter concise, non-redundant description of the same people and their relationships. The CHIL and FAM tags are not necessary. I'm not a mathemetician, but I'd guess that having to do the same thing twice in programming results in four times the amount of code and goodness knows how many times more work to keep that code from starting the next worldwar by accident.
Believe it or not, rewriting the GEDCOM as above actually tempted me to think about trying to rewrite all of GEDCOM. This is the sneaky appeal of GEDCOM which has snared many a well-meaning genealogist into becoming a GEDCOM savant: fiddling with GEDCOM is fun. It seems so accessible, since it's human-readable. Let's just toy with GEDCOM, why don't we, instead of tracking down all those pesky ancestors. After all, recording our findings is hardly worth the effort since GEDCOM still hasn't been replaced, and as we all know, whether we trumpet it or not, GEDCOM files aren't worth the pixels they're printed on.
But the main thing we have to remember is that SQL was only a few years old when GEDCOM was born, and if not for this unfortunate circumstance, GEDCOM would not have had us all laboring down a blind alley all these years.
So much for not complaining, but I'm only human.
|
|
|
Post by Uncle Buddy on May 12, 2022 23:21:13 GMT -8
The Laws of Importing GEDCOM 1) Good GEDCOM is bad. 2) Use the way GEDCOM is structured, it's faster than restructuring it correctly before using the data.
Here's the right way, based on the brainstorming in the last post.
0 @I4@ INDI 1 NAME David /Flood/ 1 FAMC @F2@
0 @I5@ INDI 1 NAME Samuel F. /Flood/ 1 FAMS @F2@ 1 FAMS @F5@
0 @I6@ INDI 1 NAME Nora Naomi /Mills/ 1 FAMC @F7@ 1 FAMS @F2@
0 @I15@ INDI 1 NAME _____ /_____/ 1 FAMS @F5@
0 @I16@ INDI 1 NAME Annie Mae /Flood/ 1 FAMC @F5@ 1 FAMS @F6@ ... 0 @F2@ FAM 1 HUSB @I5@ 1 WIFE @I6@ 1 CHIL @I4@
0 @F5@ FAM 1 HUSB @I5@ 1 WIFE @I15@ 1 CHIL @I16@
0 @F6@ FAM 1 HUSB @I17@ 1 WIFE @I16@
famtags = {@F2@, @F5@, @F7@, @F6@}
model = {HUSB: None, WIFE: None, CHIL: []} self.families = {}
self.families = { @F2@: {HUSB: @I5@, WIFE: @I6@, CHIL: [@I4@, ]}, @F5@: {HUSB: @I5@, WIFE: @I15@, CHIL: [@I16@]}, @F6@: {HUSB: None, WIFE: @I16@, CHIL: []}, @F7@: {HUSB: None, WIFE: None, CHIL: [@I6@]}, }
When reading lines of GEDCOM: 1) Extract unique values for pointers FAMC and FAMS; for each pointer in famtags, add a copy of `model` to the main dict and add the INDI identifier to the values of the inner dicts. 2) INDI pk goes into CHIL list acc to FAMC fk 3) FAM records should be complete before reading those lines, but if not, just add whatever was not included redundantly by the export program that created the GEDCOM file.
Once all the INDI records have been read, the dict will already be complete. When the FAM records are read, the data they hold will go straight into the dict, replacing x with x and y with y. Any information that's only in one set of data will be added, and the result is a complete collection of data without ever checking to see if something is already in the dict or not.
|
|
|
Post by Uncle Buddy on May 13, 2022 1:16:07 GMT -8
Most of that will work, and the set isn't needed since the default overwriting of redundant keys in a dict does something similar, but there's one small problem.
The problem was that by starting with the FAMC/FAMS tags in the INDI record, there was no way to know if FAMS refers to a HUSB or WIFE. This is just another reason you can't incorporate redundancy in structuring your toy database: if you start down Redundancy Road, you may end up having to check or repeat everything, each time you repeat anything, or else you're not doing redundancy right. And saying FAMS for spouse instead of FAMW for wife and FAMH for husband is a case of arbitrarily drawing the line on how many tags there should be. If you're gonna design a system of parsing lines and using tags to say what the line is about, then there's no disadvantage whatsoever in using as many tags as you need in order to convey all the information you need.
The band-aid solution is to input the spouse implied by FAMS as either the HUSB or the WIFE, whichever is available, and then when reading the FAM tags, if they're backwards, switch them. This is grotesque but better than a complicated system of looking through the record for a SEX tag...
Wait. What if I were to set a variable upon reading a SEX tag (which should be called a GENDER tag, but whatever), then upon reaching the FAMS tag, it can be put into the right place in the dict. This would only work part of the time, because there's no guarantee that the SEX tag will precede the FAMS tag in the record. But I think it's generally done that way so it should help. The check still has to be done in case the SEX tag is missing or doesn't precede the FAMS tag. As regards the design of GEDCOM, informally encouraging a SEX tag to precede a FAMS tag for things to actually work well or flexibly means the system is even more inadequate (once again due to the snares built into redundancy), because SEX and FAMS are both level 1 and there should be no reason to ever care what order two level 1s come in.
Yes, I said "toy" database.
Possibly the better solution is to read FAM tags first instead of INDI tags. I don't know how that would work but possibly better. Everything else would be the same, except that we'd have a HUSB tag and a WIFE tag to squeeze info from, instead of two FAMS tags. I think that's the right way to go, because it will be faster, but first I have to think about it and make sure that reading FAM tags before INDI tags won't cause even more problems.
One reason GEDCOM is a pain is that it relies on mocking up a hierarchical structure by using level numbers as the first character in each line. This is only a mock hierarchy because you are reading one line at a time and have no information about the lines on either side unless you want to bloat the code with keeping track of that sort of thing. I've managed to read into a record's hierarchy by fiddling around with scary indexing antics, but (partly since I refuse to grow up and write the professional code I'm incapable of writing), the code gets nasty fast.
The code I'm using now is simple and easy to understand, and I want to keep it that way, or we'll be counting on each other's fingers and toes to try and track where we're at and why, through a maze of logic clues. You don't want to try and read another line while you're reading line X. When we say that all the lines between any two consecutive zero lines comprise a single record, we're pretending. I don't know about your text file reading program, but Python reads one line at a time, not a "record" at a time. This is as fun for me as it is for anyone, but...
The only simple way to read the GEDCOM is once through, one line at a time. Doing it this way is a realistic/worthwhile goal because many of the data in the lines can be input to the database on the fly, no looking back. Otherwise I'd go with the first idea I tried, which was going through the file once by converting it to a huge master dictionary of isolated records, then reading through the dictionary to get the input values. The direct-input-whenever-possible strategy is the basis for a more straightforward reading of the file than a clever system would end up becoming, though I'm not sure it's faster. Clever, convoluted code requires more cleverness down the road which begets more convolutions, etc. That's why I'm reading data straight into the database whenever possible and keeping smaller, specialized dictionaries for things that can't be input till their subordinate data is known.
I said that so I could say this. The numbered-line approach was invented by a religion whose scriptures are a collection of numbered lines. It's no wonder that GEDCOM has become some peoples' dogmatic Truth, the One and Only Way. With those numbered lines, it looks like the bible and gets almost as much attention. By "bible" I mean the Christian Bible, the Torah, the Koran, the Book of Mormon, the Course in Miracles, the Urantia Book, etc. GEDCOM isn't a computer data transfer tool. GEDCOM is scripture. Dogma.
I'm not complaining. Just rallying my strength for the next part of this learning process, and flapping my jaws always makes me feel omniscient and invincible.
|
|
|
Post by Uncle Buddy on May 13, 2022 6:13:57 GMT -8
No, not really. The lines are read in the order they exist, and from what little I've seen so far, INDI records usually come first but shouldn't have to. But if FAM records did come first, then you'd want a collection to keep FAMC/FAMS data in. Maybe just that. And if INDI tags came first, you'd want to collect only the raw data from FAMC/FAMS and insert it to the dict only after the FAM tags have been read. Do you want a different collection for both ends of the stick, or use two sets of keys to the same dict (INDI & FAM identifiers)? (This is again a flaw of using a portrayal of a database hierarchy when a real database could do all the things I've been going on about with its eyes closed. Then instead of programmers having to reinvent the wheel each time they get ready to write code for importing data from yet another export source, they'd follow the same rules they did last time: plain SQL.) I don't consider this "double-pointed arrow" to be a trivial issue. Here's a discussion on the underlying topic: genealogy.stackexchange.com/questions/12437/gedcom-indi-famc-vs-fam-chilIt really knocks me for a loop to realize that this redundancy is booby-trapped. If you're gonna have links repeated, what purpose could it possibly serve to not repeat them in full? If we could safely ignore one or the other end of the pointed stick, then we'd be home free, but with the GEDCOM exporters likely to do six of one & a half dozen of the other, we're afraid of getting poked in one eye with each end of the stick. We kinda have to always check for family members in both the FAM record and the INDI record even though the two record types are supposed to say the same thing. Well not exactly the same, because FAMS doesn't say whether it's about HUSB or WIFE. So I guess I should be saying they should be supposed to be... never mind. The GEDCOM-parsing code I've written so far as it stands is so simple that I don't want to add one line to it until I know how I am going to proceed. Being in a hurry is often the best way to spoil a good thing. So I haven't written one line of code today. Here's the next version: 0 @I4@ INDI 1 NAME David /Flood/ 1 FAMC @F2@
0 @I5@ INDI 1 NAME Samuel F. /Flood/ 1 FAMS @F2@ 1 FAMS @F5@
0 @I6@ INDI 1 NAME Nora Naomi /Mills/ 1 FAMC @F7@ 1 FAMS @F2@
0 @I15@ INDI 1 NAME _____ /_____/ 1 FAMS @F5@
0 @I16@ INDI 1 NAME Annie Mae /Flood/ 1 FAMC @F5@ 1 FAMS @F6@ ... 0 @F2@ FAM 1 HUSB @I5@ 1 WIFE @I6@ 1 CHIL @I4@
0 @F5@ FAM 1 HUSB @I5@ 1 WIFE @I15@ 1 CHIL @I16@
0 @F6@ FAM 1 HUSB @I17@ 1 WIFE @I16@
self.families = { @F2@: {CHIL: {@I4@, }, FAMS: [@I5@, @I6@], WIFE: @I6@, HUSB: @I5@ }, @F5@: {FAMS: [@I5@, @I15@], HUSB: @I5@, WIFE: @I15@, CHIL: {@I16@, }}, (incomplete dicts omitted; GEDCOM hasn't been read completely) } At this point all of the data is stored as presented by the GEDCOM. The value of the CHIL key is a set so that adding the same person ID to it from both the FAM.CHIL and the INDI.FAMC will not cause a duplicate. The nature of the set itself (no repeated elements) eliminates the need to check for duplicates. HUSB and WIFE are just straight out of the FAM record and FAMS values are straight out of the INDI record, no costly questions asked. The GEDCOM has all been read and the self.families dict is finished. All that's left before the FAM record and its redundantly semi-repeated data can go into the database is to reconcile the two unisex individuals in the FAMS list with the HUSB and WIFE units. How about this pseudocode: if set(self.families[@F2@][FAMS]) == set(self.families[@F2@][WIFE], self.families[@F2@][HUSB]): pass else: figure_it_out()
I gave this a quick test in the Python console: >>> sf = {2: {'CHIL': {4}, 'FAMS': [5, 6], 'WIFE': 6, 'HUSB': 5}} >>> abc = sf[2] >>> set(abc['FAMS']) == set([abc['WIFE'], abc['HUSB']]) True If True, the FAMS side of the equation can be discarded and nothing needs to be done. If False, the order of items in the FAMS list isn't the problem. If there are too few items in the FAMS list, but what's there is correct, nothing needs to be done. If len(items) > 2, add the FAMS data to the exception report.
|
|
|
Post by Uncle Buddy on May 13, 2022 7:29:31 GMT -8
While parsing the INDI record (`data` in this line is a pointer):
elif tag in ("FAMS", "FAMC"): if data not in self.families: self.families[data] = {} if tag == "FAMS": self.families[data].setdefault("FAMS", []).append(self.curr_id) elif tag == "FAMC": self.families[data].setdefault("CHIL", set()).add(self.curr_id)
Output:
line 167 self.families: {'@F2@': {'CHIL': {33, 2, 3, 4, 7, 8, 9, 10, 11, 13, 29}, 'FAMS': [5, 6]}, '@F1@': {'FAMS': [2, 3]}, '@F5@': {'FAMS': [5, 15], 'CHIL': {16}}, '@F7@': {'CHIL': {6, 20, 22, 25, 26, 27, 28}, 'FAMS': [18, 24]}, '@F3@': {'FAMS': [11, 12]}, '@F4@': {'FAMS': [13, 14]}, '@F6@': {'FAMS': [16, 17]}, '@F8@': {'FAMS': [20, 21]}, '@F9@': {'FAMS': [22, 23]}, '@F10@': {'FAMS': [29, 30], 'CHIL': {32, 31}}}
The HUSB & WIFE keys are missing from the dict because the FAM records haven't been read yet.
Notice that children are going into the CHIL set from the FAMC line, same place where data from the FAM.CHIL line is stored. Since children aren't differentiated by gender in the FAM record, the FAMC isn't missing anything and they can go straight into the same set. Since it's a set, duplicates are ignored, so we don't have to check for duplicates.
Instead of using a model sub-dict for each FAM pointer, `dict.setdefault()` creates missing keys. This is more efficient, the code is more concise, and the final dict will not have any empty values.
Most relevant to the title of this thread, persons 2 & 3 were manually added by me to the GEDCOM via FAMC and FAMS but left missing from the FAM record. They appear in self.families as expected, with no need to add them to an exceptions report.
The dict will be used to input all its data to the database after the GEDCOM file is all read.
Do not be offended that Jupiter and Neptune Flood, who appear to be siblings in the above results, appear to be spouses. I added them manually to test this unredundant "bad GEDCOM", but they were not bad people. Let's say that one of them was adopted, and the other didn't live with his biological family after the age of two. So the two of them were not related and didn't meet till they were in their 50s.
And that's the truth.
|
|
|
Post by Uncle Buddy on May 13, 2022 22:20:43 GMT -8
Here is the self.families dict, finally ready to input complete data to the database, except for the exception in which the FAMS tags existed for FAM identifier @f1@, but the corresponding HUSB/WIFE tags didn't exist in the INDI record. I went back to adding an identical blank dict `model` to hold each family's data while reading the .ged file. Using dict.setdefault() no longer worked for me when things got complicated.
line 69 self.families: {'@F2@': {'FAMS': [5, 6], 'HUSB': 5, 'WIFE': 6, 'CHIL': {33, 2, 3, 4, 7, 8, 9, 10, 11, 13, 29}}, '@F1@': {'FAMS': [2, 3], 'HUSB': None, 'WIFE': None, 'CHIL': set()}, '@F5@': {'FAMS': [5, 15], 'HUSB': 5, 'WIFE': 15, 'CHIL': {16}}, '@F7@': {'FAMS': [18, 24], 'HUSB': 18, 'WIFE': 24, 'CHIL': {6, 20, 22, 25, 26, 27, 28}}, '@F3@': {'FAMS': [11, 12], 'HUSB': 12, 'WIFE': 11, 'CHIL': set()}, '@F4@': {'FAMS': [13, 14], 'HUSB': 14, 'WIFE': 13, 'CHIL': set()}, '@F6@': {'FAMS': [16, 17], 'HUSB': 17, 'WIFE': 16, 'CHIL': set()}, '@F8@': {'FAMS': [20, 21], 'HUSB': 21, 'WIFE': 20, 'CHIL': set()}, '@F9@': {'FAMS': [22, 23], 'HUSB': 23, 'WIFE': 22, 'CHIL': set()}, '@F10@': {'FAMS': [29, 30], 'HUSB': 30, 'WIFE': 29, 'CHIL': {32, 31}}}
line 127 case not handled ident: @F1@
The code is still simple and straightforward, and the gambit for dealing with improperly redundacized family tags seems to work. It's become apparent by now that a spouse mentioned only in FAMS will have to go into an exception report because it's not Treebard's place to delineate whether a spouse is a husband or a wife. The only problem now is that the redundant double-pointed arrow exists at all. If the GEDCOM standard would call for a FAMW and a FAMH tag instead of two ambiguous FAMS tags in the INDI record, at least this redundancy problem would have been defined consistently. It's possible that GEDCOM 7 has already tried to correct this problem [...NOT!...], but even if it did, it doesn't do anything about the jillions of family trees out there that are being used in apps that don't support GEDCOM 7.
The world of genealogy might be expected to deal with multiple versions of GEDCOM, but with that kind of work required to simply import a bunch of data, we'd be way ahead of the game to ignore future tweaks to the GEDCOM "standard" and go straight to the real standard for storing related data, and that is SQL. If we're gonna have to do all our work over for some new standard, then let's skip the next 30 years of incremental and superficial tweaks--tweaks which will mostly complicate the GEDCOM structure--and define a new standard right now in SQLite.
The amount of what should be unnecessary work required by the redundancy of the FAM/FAMS/FAMC arrangement in GEDCOM isn't the end of the world. I've been happy to have GEDCOM to use over the years. But expecting all genieware developers to be GEDCOM savants is unrealistic. Let's get with the 21st century. GEDCOM is zombie software from the dawn of the personal computer age. Talented genealogists who love and understand GEDCOM should be using their considerable ability and single-minded focus to replace GEDCOM with software that was created specifically for defining relationships among data. Please stop fixing GEDCOM and let the poor thing take its rightful place in software heaven. We have to choose between fixing it and letting it die of natural causes. Keeping it limping and staggering along is a cruel punishment for a tool that has served us so long and so... sort of OK.
Still expecting genealogists to use GEDCOM in 2022 is like expecting us to use a 38-year-old computer to compile our work. Theoretically possible, but not recommended. I was late getting into computers: mid 1990s. My first scanner cost $1000. My first hard drive had 750 Mb storage space. A floppy disc wouldn't hold one picture. My CD drive cost $300 and my tape drive was such a pain I only used it once. Do we really want to use the genieware file transfer tool that's left over from those Wild West days?
If we have to recreate all our family trees from scratch in order to replace GEDCOM properly, then I for one am happy to do it. Ecstatic to do it. For most of us, this is a hobby. Why should hobbyists be stuck in the last century in support of the interests of professional genealogists (who can't redo all their work if GEDCOM is replaced)? Do professional genealogists work for us for free? Well yeah, if they feel like it. Do we work free for them? Every time we spend days sifting through badly imported data so the world can keep using GEDCOM, we are working free for whoever happens to be motivated, for whatever reason, to keep competent standardization of file transfers beyond arm's length, a carrot on a stick.
The slogan at Treebard University, now that we are actually learning how to import GEDCOM, is still "Forget GEDCOM, just share the whole tree."
|
|