How GEDCOM Gets Citations Wrong

new

« Prev
1
Next »

Uncle Buddy
Administrator

Posts: 625

How GEDCOM Gets Citations Wrong May 21, 2022 1:13:20 GMT -8

Quote

Post by Uncle Buddy on May 21, 2022 1:13:20 GMT -8

With the recent simplification and improvement of how I go about reading subordinate lines in GEDCOM, I was disappointed to find that for the PAGE tag--which is about citations--it didn't work right. I shouldn't have expected it to work right, and the reason I did is that for the subordinate NAME tag, it apparently did work right. I was able to make the needed primary key for each NAME tag even though the tags occur on subordinate lines.

The reason my code worked for NAME tags but not PAGE tags was somewhat incidental. There was only one name given for each individual, so only one primary key was created for each person in the database name table. That's the simplified version. I will have to add an a.k.a. for one of the people in the GEDCOM file I've been using, to see what happens in that case. But in comparison, the citations I'd added to the family tree represented by the .ged file were copied & pasted over and over. In an obituary there might be several names found, and the same citation would be linked to each of these names. So for every PAGE line repeating a given citation, the citation text was re-inserted to a new record in the citation table, with a new primary key, and this new record was linked to the name in the links_links many-to-many junction table.

So it's not really that the code isn't working. It's working great. The problem is that GEDCOM is once again breaking a cardinal rule of programming: don't repeat yourself. But whose fault is that really? The program I used to input the data originally, Family Historian, doesn't use a database. It just stores its data directly into a GEDCOM file. That's a cute idea. If GEDCOM wasn't broken before it left the factory, it might be a good idea, although using a text file instead of SQL will never be a great idea. But as it stands, with GEDCOM's having been born and raised to represent oversimplified data, I'd have to say that it's one of those cute-but-useless ideas. Nothing but trouble.

Family Historian claims to have a couple of innovative systems for making it easier to save citations repeatedly. This is a selling point, because some citations do need to be linked over and over to different elements. But it turns out that the data structure being used is not innovative, just the GUI. With each repetition of the same citation going into a subordinate line, well, to put it simply, GEDCOM doesn't do many-to-many relationships. If it does, then the means of getting it to do this are not obvious nor well known to the various amateur and expert programmers who are trying hard to create genieware, but hobbled by the need to import and export data. Under the circumstances, with everyone out there repeating the nonsense phrase of the week: "The GEDCOM standard..." well it's no wonder everyone is in a quandary about how to proceed. We should pick our standards more carefully.

The next direction this rant takes could be, "Don't fix broken software, replace it..." but I want to try and sniff out practical solutions instead of throwing the baby out with the bathwater. This doesn't mean I'm about to suggest upgrading to GEDCOM 7. GEDCOM or its replacement needs to work right, first time, every time. With GEDCOM's creators stating that it should not be changed too quickly, the task of replacing it is, or should be, out of their hands. It will or should fall into the hands of parties willing to place more importance on the future of genealogy than on the body of our past work. Future genealogists will have JUST AS MUCH FUN figuring out who our ancestors were as we did. Especially if they don't have to fiddle around with GEDCOM.

Computerized genealogy amounts to a few decades of work that might have to be partially done over when we do replace GEDCOM. Compare that hardship to the vastness which is the future of genealogy. The longer we give our power to a stubborn unwillingness to redo our work, then the worse the problem will get. We already redo a lot of our work anyway, when importing, due to GEDCOM's deficiencies. If I'm not willing to do my family tree over in order to get it into a properly saved and exportable format, then someone else will be ecstatic to do it for me. Genealogy is fun. Those who have made genealogy their job or obligation, well sorry, but maybe the open source community is gonna do what needs to be done, with or without the assistance of the programs which are dying as we speak... because of GEDCOM. Maybe it is the king-of-the-mountain blind obstinacy of the profit motive that is keeping this thing festering. When enough genieware providers break loose of the two pesos profit they stand to earn for their trouble and get serious about making good software, a revolutionary consortium of relatively altruistic data-mongers is gonna spring up out of the foaming froth of discontent and replace GEDCOM without anybody's permission.

But that's not what this thread is about.

I mean to discuss pragmatic solutions to this wriggling, wiggling, writhing tangle of spaghetti-esque GEDCOM anti-patterns.

So about those citations. I see two possible approaches. Someone else might see plenty more.

1) Write a bunch of code to recognize PAGE tags as being properly primary-key or zero-level tags by making a single database record for each unique citation. A non-unique citation would be a citation with identical text linked to the same source. So "line 62" linked to "source 16" would be detected as occurring over and over in the GEDCOM. A single citation_id would be created for it as a primary key in the citation table. Then for each recurrence, the citation_id column in links_links would be populated with that key and some other column in the same record would get a foreign key for whatever element is being linked to the citation, such as name_id. To get this ball rolling, graze the .ged file for PAGE tags, put them in a list, and the path will reveal itself from there.

2) All of the above, but instead of putting the re-arranged data directly into the database, make an example of how GEDCOM should have been written for future reference. This is done by translating GEDCOM to another GEDCOMoid such as DATABOY. Then DATABOY is imported to the genieware. This option could be done one of two ways:

a) Rewrite the GEDCOM and import the GEDCOM to the genieware, or
b) Translate the GEDCOM into a completely different format such as DATABOY and import the DATABOY to the genieware.

Since GEDCOM is inherently deficient, I think it would be silly to rewrite an existing GEDCOM file as a modified GEDCOM file. So I'm going with option 2b. It's maybe the hardest option, and probably the right one.

Last Edit: May 21, 2022 1:15:25 GMT -8 by Uncle Buddy

Scott Robertson (Professor U. d'Guru)
"If you don't build it, it won't work."

Uncle Buddy
Administrator

Posts: 625

How GEDCOM Gets Citations Wrong May 21, 2022 1:14:42 GMT -8

Quote

Post by Uncle Buddy on May 21, 2022 1:14:42 GMT -8

Getting back to the comparison of how names worked right kinda unexpectedly, while citations worked right, but only according to GEDCOM's "let's just lower the bar" standards.

I input the last three lines below to the GEDCOM file manually:

0 @I4@ INDI
1 NAME David /Todd/
2 SOUR @S2@
3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary
1 NAME Samuel D. /Todd/
2 SOUR @S2@
3 PAGE S. F. Todd death notice

The results from both the name and citation table appear correct:

sqlite> select * from name;
name_id|person_id|names|name_type_id|sort_order|used_by
...
2688|4|David Todd|1|Todd, David|
2689|4|Samuel D. Todd|1|Todd, Samuel D.|
...

sqlite> select * from links_links where name_id in (2688, 2689);
links_links_id|person_id|places_places_id|name_id|source_id|citation_id|finding_id|claim_id|project_id|to_do_id|contact_id|image_id|note_id|role_id|repository_id|report_id|INTEGER|chart_id|media_id
167|||2688||167|||||||||||||
168|||2689||168|||||||||||||

sqlite> select * from citation where citation_id in (167, 168);
citation_id|source_id|citations
167|2|10 September 1962, page 2, Samuel F. Todd obituary
168|2|S. F. Todd death notice

The only reason the citation results aren't repeated like the other citations is that the citation wasn't input more than once. Why is this problem not popping up at all when I make a name_id primary key, which I'm also doing in the code which reads subordinate lines?

This goes back to the founding principles of DATABOY's parent project, the DATABOY Refactorium Conspiracy. Persons have a one-to-many relationship with names: a person can have many names, but each of his names is his-only (even if it's spelled the same as someone else's name). But names have a many-to-many relationship with citations. Each citation can refer to many names, and each name can be mentioned by many citations. I'm no data specialist, not even close, but this smells like maybe the reason that the code which correctly handled names didn't know what to do with citations.

Code is stupid. Not as stupid as text but it's still stupid. It has to be written correctly or it won't do what you expect. In this case, GEDCOM has not bothered to say, "Here's some citation text which is found in source X and will be repeated over and over." I just gave it a unique ID and walked away.

What's needed is more code to deal with many-to-many GEDCOM tags, which GEDCOM doesn't recognize as Elements of Genealogy that therefore require primary keys of their own. I got away easy with names because a name is not going to be repeated from person to person. If they could, I'd be having the same problem with names, but they can't.

Main point: DATABOY is based on the principles of relational database design. It just occurred to me that I don't know what the name of the chapter in the data theory book is which deals with this topic. Hang on while I google it.

OK, the word is "cardinality" but the term is used two different ways with regard to databases, so here's a quick answer from Oded at StackOverflow.

Q: What is cardinality in databases?
A: A source of confusion may be the use of the word in two different contexts - data modelling and database query optimization.

In data modelling terms, cardinality is how one table relates to another.

1-1 (one row in table A relates to one row in tableB)
1-Many (one row in table A relates to many rows in tableB)
Many-Many (Many rows in table A relate to many rows in tableB)

There are also optional participation conditions to the above (where a row in one table doesn't have to relate to the other table at all).

See Wikipedia on Cardinality (data modelling).

When talking about database query optimization, cardinality refers to the data in a column of a table, specifically how many unique values are in it. This statistic helps with planning queries and optimizing the execution plans.

See Wikipedia on Cardinality (SQL statements).

OK, so if we're talking about whether a pair of data have a one-to-one, one-to-many, or many-to-many relationship, we're talking about that data's cardinality.

The important question is still: What am I going to do about it?

Based on this much information, it seems there are two ways to go:

1. Detect all subordinate tags in the GEDCOM file which should be used to create primary key records in the database, and use them to do so.
2. Detect only subordinate tags which not only should be used to create primary key records, but also represent many-to-many data relationships.

I favor the former approach as it involves less fiddling around, less of that expensive splitting of the hair.

What keeps nagging at me is that the means I've come up with for saving associated data in preceding lines won't be getting that data in time to do anything about it. But that situation already exists anyway, for example, when creating a person in the person table from a line like `0 @i4@ INDI`, an insert query makes the record but the record isn't yet complete. A subordinate line might come up giving the person's gender, so an update query will have to go back to that line and add "male" to the gender column in the new person table row. So, if citations are created when all we have is the citation text, then the new citation_id is saved in the "most recent n - 1" way in the self.instrux list. When the other data shows up (name_id and source_id), if the self.instrux method still works, the citation row can be updated with the related source_id and a new row can be created in links_links to relate the name and the citation.

Citations can be linked to lots of things, not just names. Just thought I'd mention it. We've only just begun.

Last Edit: May 21, 2022 1:18:30 GMT -8 by Uncle Buddy

Scott Robertson (Professor U. d'Guru)
"If you don't build it, it won't work."

Uncle Buddy
Administrator

Posts: 625

How GEDCOM Gets Citations Wrong May 21, 2022 4:28:51 GMT -8

Quote

Post by Uncle Buddy on May 21, 2022 4:28:51 GMT -8

Take a look at these GEDCOM lines:

0 @I4@ INDI
1 NAME David /Todd/
2 SOUR @S2@
3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary
1 NAME Samuel D. /Todd/
2 SOUR @S2@
3 PAGE S. F. Todd death notice
1 SEX M
1 RESI
2 DATE 1962
2 PLAC Meridian, Lauderdale County, Mississippi, USA
...
0 @S2@ SOUR
1 TITL Biloxi Daily Herald, Gulfport-Biloxi-Mississippi Coast

The import module first finds all the zero tags and creates primary keys for them. Later, while reading the list of lists comprised of all the GEDCOM lines converted to sublists, for each line that doesn't start with a zero, a method `build_on_branch()` is run which so far has proved able to do something with tags NAME, SOUR and PAGE.

The discussion of NAME and PAGE is above in this thread. The SOUR tag occurs in subordinate lines because its input is a foreign key. A correct GEDCOM form would have already used NAME and PAGE in zero-line tags to create primary keys, and the key could be used here as a foreign key to avoid repeating text for multiple-use citations, but in GEDCOM-as-is, these zero lines don't exist.

What occurs to me, for a pattern like the one in the .ged lines above, is a procedure like this:

Make a list of missing_elements which would include NAME and PAGE and other tags which should be zero-level tags. Loop through the list of lines and whenever a missing_element is encountered, put it in a list with its information. Loop through the list of missing_element instances in a preliminary step which, like the current `insert_primary_key()` method, gets PKs made in advance so they'll be there when needed.

For NAME tags, the data is the name text, and its cardinality with the primary key is 1:1, so the text and the key go in the same line. Since it has no zero line, and thus no corresponding sublist in self.instrux, its value goes in its corresponding most recent n - 1 index in self.instrux, which in the examples above is the index corresponding to the INDI zero lines. Later when parsing subordinate lines, the line `1 NAME David /Todd/` will be ignored except to add its foreign key value to the corresponding index of self.instrux. The line `2 SOUR @s2@` will get this value and carry it forward till it gets to a line that can use it.

For PAGE tags, a new record is made in the citation table for only the first instance of each unique citation, and now we have a primary key for this citation and this source. The PK has no line in the GEDCOM, thus no line in self.instrux, so it has to be stored in the self.instrux index that corresponds to the line that the PAGE tag line goes to for its needed values. In the example, the PAGE line is a three-line, so when this line is parsed, attention goes to the most recent two-line, the line just above it. From self.instrux it gets the name_id and the source_id. The citation text is ignored at this point, it's already in the citation table on the same row with the PK that has to be linked to the source_id in the same row, and to the name_id in a new row of links_links.

The data in the SEX tag will just go into the person table with an update query. The right PK to look for is in self.instrux waiting for it.

The RESI line has no data yet except itself. Its n - 1 line is the person_id PK zero line. In Treebard, an imported event is input as a conclusion, which in Treebard code is called a finding. There is a finding table, and each finding (event) has a primary key. So this subordinate line also gets the treatment outlined above for NAME and PAGE tags. The tag "RESI" tells us which event_type_id to input as a foreign key while creating the residence finding. The primary key for the finding has no corresponding index, so its value goes into the self.instrux index corresponding to the most recent n - 1 line, which is the INDI zero line. Then while parsing subordinate lines, the stored finding_id is passed on to the date line below it, so the finding table row can be updated with a date. The date will first be converted to the format that Treebard uses to store dates.

The PLACE tag on the next line will also get the value of the finding_id from the appropriate index of self.instrux and make another trip to update the same line of the finding table with a place. But here we run into another of GEDCOM's over-simplifications. The whole nested place, which is really four places, has been given as a single string. To use such simple and repetitive place representations in Treebard, the string will have to be split at the commas and turned into a list. No problem, it's easy to do. Then, to prevent the import process from being interrupted to ask the user for input, each of these places should ideally go straight into the database, no questions asked. The result will be multiple duplicate places that will all have to be merged into one place manually when the import process is over. This will have to be done before the new imported tree can be used. But many users will ignore that step and just use all the duplicate places as-is. This amounts to letting GEDCOM ruin Treebard, which is unacceptable. I'll have to think about how to deal with this. Interrupting the import process for user input might be the only right way to handle this. Treebard is not pro-R-U-Sure dialogs, but makes an exception with new and duplicate place inputs, because I think that getting places right the first time is worth the trouble, to avoid having to merge places later.

Scott Robertson (Professor U. d'Guru)
"If you don't build it, it won't work."

Uncle Buddy
Administrator

Posts: 625

How GEDCOM Gets Citations Wrong May 21, 2022 22:40:22 GMT -8

Quote

Post by Uncle Buddy on May 21, 2022 22:40:22 GMT -8

I watched a funny and scarily accurate video this morning over breakfast called 10 Programmer Stereotypes in which someone finally came out and outright mentions that the stereotype of the hacker who sits down and, without any planning, types non-stop till his nearly impossible work is finished, not stopping to test, correct errors, research, make notes, or threaten to jump out the window, is 100% false. If you don't know what stereotype I'm talking about, some good examples include the beer scene in the first part of The Social Network where "Mark Zuckerberg" creates Facebook while getting drunk in his dorm room, or just about any episode of Mr. Robot or any other hacker flick where programmers are portrayed as--well... let's say they're portrayed as someone who's not going to waste their audience's precious time by showing them what writing code is really like.

Do not dismay, boys and girls, real programmers who code for hours at top speed, with nary a glance away from the Great Wall of Code, despite being intoxicated and undernourished, are few and far between. But Hollywood needs its gross exaggerations, and both the shows mentioned above are entertaining.

In contrast, this morning when I started staring at my screen, as usual it took a good part of whatever strength of character I might possess to make myself think hard about the code I've written in the past few days. What little there is of it. New ideas sharpen the focus while they're new, but understanding how they worked when the focus has gone back to wherever focus lives... Well as usual, the answer to my quandary came when I walked away from the computer and forced myself to reconstruct what the code was actually doing in my mind. At times like this, glued to a chair staring at the screen is the most distracting waste of time there is, not to mention bad for the eyes and the backside. Escaping the chair should be accomplished as often as possible.

In my labored revery, I kept drifting back to something I said three times in yesterday's final post, something to the effect of, "The zero lines for these subordinate lines being converted to primary key lines do not exist, so the values they need to store can't be stored in self.instrux..." and maybe that's where I should have stopped. What I wrote to complete those sentences was highly speculative.

The "most recent n - 1" strategy is simple and works flawlessly, where the lines being referenced do actually exist, but it's just enough of a mental somersault to actually visualize what it's doing, that trying to twist it into an all-purpose tool for cases where its inborn abilities don't quite measure up... probably not worth it.

Not when we have good old-fashioned instance variables to rely on. Simple and straightforward. I've said somewhere up above that this solution to references left over by prior lines of GEDCOM keeps escaping me. Could it be that the whole "most recent n - 1" gambit is just a clever trick being used for entertainment, whereas instance variables could get the job done without obscuring what the job is? Before starting this post, I already got an instance variable to input name_ids to the right dicts in the self.export list, without the mind salad caused tomorrow by the insidious cleverness of today. So presumably, instance variables instead of the "most recent n - 1" gambit should be used for the subordinate tags that should have been zero tags.

But the question is, could instance variables also replace the gambit entirely? If so, then I've dropped the ball over and over, each time I asked myself what I should be doing to reference prior lines and forgot to just use simple instance variables.

Let's try to preliminarily test the hypothesis that instance variables would suffice, without writing any code. Whilst I must love my precious "most recent n - 1" gambit because it was born with my DNA tatooed on its forehead, if there's a better way, a way that's simpler and more universally applicable, then why build solutions based on how GEDCOM's wretched n system works?

For each of the sample lines below, the instance variable assignment follows on the next line. The resulting sub-list in self.export list is shown next. Each sublist of self.export contains only two items: the primary key being created (if any), and data, i.e. all available values including foreign keys, that will be input to a common line of the database. The 2nd dict is nested, with its first level keys being names of database tables and its second level keys being names of database columns. Assume we've already gone through and added actual zero-line data before looking at subordinate lines.

GEDCOM line
instance variable assignment
self.export sublist [{}, {}] appended or changed

0 @I4@ INDI
self.person_id = 4
[{'person_id': self.person_id}, {}]

1 NAME David /Todd/
self.name_id = 8
[{'name_id': self.name_id}, {'name': {'names': 'David Todd', 'sort_order': 'Todd, David', person_id: self.person_id}}] 

2 SOUR @S2@
self.source_id = 2
(nothing)

3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary
self.citation_id = 23
[{'citation_id': self.citation_id}, {'links_links': {'citation_id': self.citation_id, 'name_id': self.name_id}, 'citation': {'source_id': self.source_id, 'citations': 'PAGE 10...'}}]

1 NAME Samuel D. /Todd/
self.name_id = 343
[{'name_id': self.name_id}, {'name': {'names': 'Samuel D. Todd', 'sort_order': 'Todd, Samuel D.', person_id: self.person_id}}]

2 SOUR @S2@
self.source_id = 2 
(nothing)

3 PAGE S. F. Todd death notice
self.citation_id = 24
[{'citation_id': self.citation_id}, {'links_links': {'citation_id': self.citation_id, 'name_id': self.name_id}, 'citation': {'source_id': self.source_id, 'citations': 'S. F. Todd...'}}]

1 SEX M
(nothing)
[{'person_id': self.person_id}, {'gender': 'male'}]

...
0 @I5@ INDI
self.person_id = 5
[{'person_id': self.person_id}, {}]

1 NAME Samuel F. /Todd/
self.name_id = 56
[{'name_id': self.name_id}, {'name': {'names': 'Samuel F. Todd', 'sort_order': 'Todd, Samuel F.', person_id: self.person_id}}]

2 SOUR @S2@
self.source_id = 2 
(nothing)

3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary
self.citation_id = 24
[{'citation_id': self.citation_id}, {'links_links': {'citation_id': self.citation_id, 'name_id': self.name_id}, 'citation': {'source_id': self.source_id}}]

1 BIRT
self.event_type_id = 1
self.finding_id = get finding_id from finding table where person = self.person_id and event_type_id = self.event_type_id
[{'finding_id': self.finding_id}, {'finding': {'event_type_id': self.event_type_id}}]

2 DATE 27 JAN 1882
self.finding_id = 699
[{'finding_id': self.finding_id}, {'finding': {'date': '-1882-01-27-------', 'date_sorter': '1882,1,27'}}]

2 PLAC Butler, Choctaw County, Alabama
self.finding_id = 699
[{'finding_id': self.finding_id}, {'place': {it's complicated}, 'places_places': {it's complicated}}]

2 SOUR @S2@
self.finding_id = 2
(nothing)

3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary
self.citation_id = 24
[{'citation_id': self.citation_id}, {'links_links': {'citation_id': self.citation_id, 'finding_id': self.finding_id}}]

1 RESI
self.event_type_id = 13
self.finding_id = 66
[{'finding_id': self.finding_id}, {'event_type_id': self.event_type_id}]

2 DATE FROM 1959 TO 1962
(nothing)
[{'finding_id': self.finding_id}, {'finding': {'date': '-1959----to--1962---', 'data_sorter': '1959,0,0'}}]

2 PLAC 544 Camp Avenue, Gulfport, Harrison County, Mississippi
it's complicated

And there you go. A sort-of complete self.export ready to put data straight into a Treebard database, without any stupid pet tricks. If you don't know what stupid pet tricks are, then you need to research it for yourself.

The GEDCOM substandard has played us a rotten hand with its astronomically inadequate version of what a place is and how different places relate to each other. GEDCOM's notion of one string for a whole nest of places will require some thought, but no compromises on my part. I'm sure I can get the right stuff into Treebard's database from a GEDCOM import, but it won't be easy and the user will have to get involved to prevent duplication. This is because, the right way to differentiate Paris, France from Paris, Texas is not joining them into long strings, and comparing the long strings, because (in some better example), Texas and France might be the same place depending on the era in question. In which case, the Texas being discussed should have one single place_id. There are other reasons besides that particular edge case. The only right way to deal with single places "Paris" and "Paris" is to ask the user to decide whether the new "Paris" being input is the existing Paris, Texas; or the existing Paris, France; or some new Paris, in which case it needs a new primary key.

Now I have to write the code, to find out what all I've overlooked regarding using-instance-variables-only to roll with GEDCOM's eccentricities. It seems possible that I might not have to loop twice over the lines just so primary keys can go in, but I can't tell at this point. Anyway, this is a good time to start the class again completely from scratch, to find out how easy this can possibly get.

Last Edit: May 21, 2022 22:42:36 GMT -8 by Uncle Buddy

Scott Robertson (Professor U. d'Guru)
"If you don't build it, it won't work."

Treebard Genealogy Software

Treebard Genealogy Software: setting the record straight since 2020

How genealogy software should work

How GEDCOM Gets Citations Wrong

Post by Uncle Buddy on May 21, 2022 1:13:20 GMT -8

Post by Uncle Buddy on May 21, 2022 1:14:42 GMT -8

Post by Uncle Buddy on May 21, 2022 4:28:51 GMT -8

Post by Uncle Buddy on May 21, 2022 22:40:22 GMT -8

Treebard Genealogy Forum is for suggesting changes in family tree conclusions and software design.