Link Source to What?

Link Source to What? May 14, 2022 6:20:29 GMT -8

Quote

Post by Uncle Buddy on May 14, 2022 6:20:29 GMT -8

0 @I5@ INDI
1 NAME Samuel F. /Flood/
2 SOUR @S2@
3 PAGE 10 September 1962, page 2, Samuel F. Flood obituary

The above scrap of GEDCOM looks innocent enough, and it is. It could easily lead to one of computer genealogy's more mistakes: linking sources to conclusions.

First a little background.

I saw that it was almost time to extract sources from GEDCOM which would be subordinate to several tags in the INDI record. Previously, level-1 SOUR tags were directly linked to INDI records, but now I'm getting ready to work on linking level-2 SOUR tags to various level-1 tags including NAME, OCCU, RELI, RESI, BURI, BIRT, DEAT, and EVEN.

So I had to take a look at my database structure for conclusions and assertions and their linked citations and sources. Previously I hadn't known how I'd deal with this, but since I now have more experience, I was able to work out a plan and adjust the database to reflect the new plan.

Treebard is unique in that it links conclusions to assertions and assertions to citations. Citations are subordinate to sources in the database, just as PAGE is subordinate to SOUR in GEDCOM. The main point is that conclusions (events and attributes as perceived by the software user) are not linked directly to sources and citations. In Treebard, sources make an assertion. They assert or claim something about an event or attribute. But no matter how many sources you have for an event, the source never makes decisions about what happened. It just asserts some things and the user decides what actually happened and records this as a conclusion or finding.

Like most of computer genealogy, GEDCOM ignores assertions and treats everything as a conclusion, linking sources directly to conclusions as if either the user or the assertion doesn't exist. Treebard didn't invent assertions; it's just that computer genealogy generally pretends there's no such thing.

So in Treebard we have a claim table for recording assertions (what the source says) and a finding table for recording conclusions (what the user thinks happened). The two tables have nearly the same columns, such as event type, date, place, and particulars. The difference is that the finding table record is linked to zero-to-many assertions upon which the user can optionally base his findings, while the assertions are required to be linked to citations on a one-to-one basis. Exactly one citation for each assertion. This is meant to portray the world more accurately and usefully than the mud we normally see as sourcing in genieware data structures.

Since the GEDCOM links directly from citation to source to name or conclusion (event), how can Treebard accommodate the GEDCOM-encapsulated data? Treebard's main display, like most genieware, shows the conclusions. It's simpler to display than the many assertions which could back each conclusion up, and it's what the user wants to see. So, if I want the Treebard user to see in the GUI what has been imported, I have to input the data as conclusions, but if I don't want to break Treebard's flow of linkages, I have to fill in the links. Since there is a source, I have to create one assertion in the claim table for each citation, and I have to create one conclusion in the finding table for each assertion. The user can add any number of assertions later to support a single conclusion.

Here's the actual chain of links as Treebard sees it:

Repositories and sources have a many-to-many relationship:
--FamilySearch.org and archive.org etc. all have the 1850 census.
--The 1850 census and the 1860 census etc. are all available at archive.org.
The database table `sources_repositories` has foreign key columns for both repository_id and source_id.

Sources and citations have a one-to-many relationship:
--The 1870 Roundtable Township census has citations for Uncle John's family as well as Aunt Edna's family.
--Those citations and others are all in a single census source.
The claim table has a foreign key table for source_id.

Claims and findings (aka assertions and conclusions) have a many-to-many relationship.
--Each conclusion can be supported by zero or more assertions.
--Each assertion can support any number of conclusions.
The database table `claims_findings` has foreign key columns for both claim_id and finding_id.

Translating GEDCOM to this structure, here's the GEDCOM again:

0 @I5@ INDI
1 NAME Samuel F. /Flood/
2 SOUR @S2@
3 PAGE 10 September 1962, page 2, Samuel F. Flood obituary

The person_id 5 already exists and the name_id already exists for Samuel. The SOUR line is read right after the NAME line, so it's safe to just use the name_id most recently created. A Python instance variable self.name_id was given this value when the name was input to the database. The sources have already been put in, so now that we've got a foreign key (2) for the source and a value for self.name_id, we just add them to the same record in...

Do I have a place to put them? I have an assertions table called `claim`. The event "naming" or "christening" or "change of name" could go there, for example, but a name is a special element of genealogy with its own database table. It is not an event and it is beyond an attribute. Each name even has an event type, so it really doesn't fit in either event table as a claim or finding. But it can still have a source.

That must be why I created a table called `links_links` once upon a time. The double plural suggests a many-to-many table, and that's what it is. I haven't used this table yet, so I have to look at its schema to see what's in it. Sure enough, there's a column for a name_id foreign key and another column for a citation_id foreign key. So a record will be created, a single line in links_links with only those two values.

GEDCOM being a text file pretending to be a database, there's no way for it to check its own logic by doing something. It doesn't do anything. The programmer has to write a program to make it meaningful. In this case, we see that the 0>1>2>3 hierarchy does not represent reality as perceived by Treebard. Names and sources have a many-to-many relationship, so sources are not really subordinate to names. The subordinate structure implies that the source is somehow a part of the name in the way that a citation is part of a source and a name is part of an individual. The lines in their hierarchical arrangement suggest that sources are linked to names, but it's a specific citation within a source that makes an assertion about a person's name. The computer user has to make his own decision whether to agree with the assertion. But you can only do so much with lines of text tagged by more text. We should have no trouble inputting the data anyway, even though in order to input it to Treebard, we'll have to upgrade GEDCOM's logic to match the actual world we live in.

Normally Treebard never tells or suggests to the user what should be put into the events (conclusions) table which is the central feature of the front page of Treebard, and assertions never morph into conclusions. But when the user decides to import a GEDCOM file, I guess he's concluding that the conclusions in the GEDCOM deserve to be shown as conclusions. He can always change this, after importing the file, by deleting, replacing or editing a conclusion. This has no effect on the data stored as assertions.

Link Source to What? May 14, 2022 21:34:45 GMT -8

Quote

Post by Uncle Buddy on May 14, 2022 21:34:45 GMT -8

The result of GEDCOM's pretending to be a database is once again more work on the part of the person writing the import code. Not complaining about the work, it's my hobby.

Writing code. Not complaining. Is my hobby.

Well maybe both.

0 @I6@ INDI
1 NAME Nora Naomi /Mills/
2 SOUR @S2@
3 PAGE 10 September 1962, page 2, Samuel F. Todd obituary

The citation (PAGE tag) is what needs to be linked to the source, as the GEDCOM shows. But it would be wrong to link the source to the name, which the GEDCOM also suggests. So are the lines numbered wrong?

Not exactly. There might not be a way to number them right since this cute-but-useless line numbering design trait just doesn't cut the mustard. This just points out the inability of the GEDCOM design to accommodate reality. The source is not subordinate to either the NAME or the PAGE, in real life. Just in GEDCOM. In a real database, names and citations should be linked in a many-to-many table. Many citations can refer to a single name, and many names can be referenced by a single citation. The notion of subordinate doesn't work here because elements linked in a many-to-many junction table are equals in a hierarchy (not that thinking in terms of a hierarchy is that necessary or useful?) so neither is subordinate.

The simple solution, if I cared to fix GEDCOM, would be to stop toying with the data and put it where it's needed, so the developer doesn't have to fuss with it so much. The name_id and source_id foreign keys will have to be stored until we can read the 3-line and get the citation. The citation has to be put into the citation table in the database. Then we have a foreign key, citation_id. In the big catch-all junction table links_links, citation_id and name_id will live together in the same row. The source_id foreign key 2 will go in the row of citation table where the text is stored.

Since the source has no direct relationship to the name, this is what the GEDCOM should be like...

...ummmm...

Well sorry, but fixing GEDCOM is not worth our time. It just plain and simple doesn't represent how data works, and if it did, it would be used for everything, not just for genealogy. I guess the genealogy community has just failed to notice that the rest of the world is using real databases, such as SQL or NOSQL (which is a real nested structure) to store data.

Treebard Genealogy Software

Treebard Genealogy Software: setting the record straight since 2020

How genealogy software should work

Post by Uncle Buddy on May 14, 2022 6:20:29 GMT -8

Post by Uncle Buddy on May 14, 2022 21:34:45 GMT -8

Treebard Genealogy Forum is for suggesting changes in family tree conclusions and software design.