GEDCOM's source_citation cardinality problems
Dec 8, 2023 0:01:39 GMT -8
Post by Uncle Buddy on Dec 8, 2023 0:01:39 GMT -8
I'm making fast progress on most of the GEDCOM export program for exporting from UNIGEDS to GEDCOM. GEDCOM's SOURCE_CITATION construct is another story altogether, as it is not a good match for reality. Here are today's notes, which turned into one of my famed rants:
source-to-citation is one-to-many
citation-to-assertion is one-to-many
event-to-assertion is one-to-many
Based on these specific data relationships, UNIGEDS cannot link a source to anything without an assertion, because for every event there can be many assertions; for every citation there can be many assertions; and GEDCOM's sourcing structure does not account for the real-world relationships of the participating elements.
Ignoring these specific data relationships, GEDCOM's TEXT (assertion) tag is subordinate to a meaningless DATA tag, which is subordinate to PAGE (citation), which is subordinate to a SOUR (source) pointer, which is subordinate to EVENT/DEAT/BIRT (event), which is subordinate to INDI (person) or FAM (couple). GEDCOM's notion of a direct line of subordination is so different from UNIGEDS' modeling of the real-life situation that the real relationships are hard to express at all in GEDCOM and impossible to express completely in GEDCOM. The first thing we notice is that every separate assertion links to a citation, and since assertions can't be exported from the UNIGEDS structure to the GEDCOM structure, a lot of extra citation references will have to be weeded out. They would not be extra if assertions could be expressed correctly in GEDCOM, but since assertions have to be sent on to the exceptions log, the resulting seemingly extraneous citation references would make no sense. BUT WHY NOT JUST PUT THE DATA.TEXT construct in the GEDCOM anyway? While we're at it, is it possible to write one method which can equally put a GEDCOM "source_citation" subordinate to INDI.EVEN, INDI.NAME, INDI, FAM, OBJE, or NOTE?
This example shows a GEDCOM "source_citation" subordinate to an individual event (INDI.EVEN|DEAT|BIRT|etc). The source_citation can also be subordinate to INDI.NAME, INDI, FAM, OBJE, or NOTE.
Corrected GEDCOMoid using custom tags:
Therefore we have no other option than to learn something new, something which no one including myself has yet realized, about the GEDCOM problem, and that is this: what DOES GEDCOM express about cardinality?
It seems like GEDCOM pretends everything is a one-to-many relationship (see chart), which often works out since one-to-many is a common relationship between data pairs and when the designers of GEDCOM got lucky the subordinate structure wasn't too simple or too linear to express some one-to-one and many-to-many relationships also. But GEDCOM only has one way to show data relationships: subordination, which is a linear chain with wobbly branches, like those hanging mobiles we made in 5th grade art class which were so hard to balance, and the strings got tangled up in a slight breeze. See how one thing is linked to many in every single case? This implies a lack of any structure devoted to one-to-one or many-to-many links, as if every link in the world is supposed to be one-to-many. We just "have to" use it anyway, because it's "all there is". Why has computer genealogy forgotten the wider field of computing, which uses SQL to express related data, instead of this wobblesome noise:
Here's the problem with that: real data relationships--with a solid, definable, definite, objective description--exist between the members of a data pair, like "event-to-assertion" and "citation-to-assertion". Real relationships are not accurately described as a chain of nesting or subordination, but a web of links going in every direction. This web of related facts adds up to what we perceive as reality or history. This web is a sturdy, flexible structure. It can grow or shrink, it can lose a link without losing its meaning or value. But GEDCOM's linear structure depends on not losing a link, so the fact that it misrepresents reality makes it useless for conveying the more intricate details of the world, so the more detailed the attempted description, the more complex becomes its uselessness. Correctly-defined relationships between two data types, on the other hand, remain simple and easily defined, and this is why reality can be described at all: as you look closer from the right perspective, things keep getting simpler. When the opposite happens, you need to clean your glasses, or else get a job writing rules for everyone else to try and follow.
The first main problem with GEDCOM's source_citation is that event-to-assertion and citation-to-assertion are both one-to-many relationships. With assertion being the many side of two superior elements, a linear system of subordination breaks down. You need SQL's omnidirectional web of simple, small, easily defined pairs of data types. There's no way to say, in GEDCOM, that assertion is subordinate to both event and citation, because subordinate (which implies a chain with wobbly branches) is the wrong description. The right word is "link", which implies the strongest structure in the world, for its weight: a net or web. Like the geodesic dome. The weightiness of GEDCOM, for its strength, has been well-noted.
We have only considered sourcing of events, but names are the other thing, besides event-parts like date, place, particulars, age, and roles, which need to be sourced. (GEDCOM also allows sourcing of OBJE (multimedia files), NOTE, INDI (person) or FAM (couple), all of which seem sort of abstract if not downright weird.) But clearly there are at least three superior tags all competing to be the parent tag of TEXT in a system of linear subordination which won't allow tags to have more than one parent, because name-to-assertion is also a one-to-many relationship. In UNIGEDS' assertion table, there are foreign key columns for name_id, event_id, and citation_id. Problem solved.
The second main problem with GEDCOM'S source_citation is that source, in the real world, is linked to nothing else in the source_citation tag chain except citation. That's why my attempt above to redo the source citation with custom tags failed (unless I'm just an idiot): the linearity of the model didn't give source a place to be. This is not true of SQL, where everything has exactly one place to be, because everything that needs an ID has an ID, so can be referenced anywhere.
Now that we know the flavor of the stew we're chopping our data into, we still want to know: can the odd GEDCOM-specified constructs be used anyway, by biting the bullet and dumbing down the data structure? If not, then the whole sourcing feature either needs to be 1) put into an exceptions report, or 2) expressed with custom tags, which we've shown above won't work with a linear chain or subordinate data. So the only option is to dumb down the data structure and use DATA.TEXT for assertions, or put the entire source_citation into an exceptions log.
Not that we want to play the blame game, but can anyone see why, with GEDCOM as a supposed data standard, assertions have been barely touched upon in most genieware? If we want customers, we need import/export, but if we want better genieware, we need better data structure for import/export.
source-to-citation is one-to-many
citation-to-assertion is one-to-many
event-to-assertion is one-to-many
Based on these specific data relationships, UNIGEDS cannot link a source to anything without an assertion, because for every event there can be many assertions; for every citation there can be many assertions; and GEDCOM's sourcing structure does not account for the real-world relationships of the participating elements.
Ignoring these specific data relationships, GEDCOM's TEXT (assertion) tag is subordinate to a meaningless DATA tag, which is subordinate to PAGE (citation), which is subordinate to a SOUR (source) pointer, which is subordinate to EVENT/DEAT/BIRT (event), which is subordinate to INDI (person) or FAM (couple). GEDCOM's notion of a direct line of subordination is so different from UNIGEDS' modeling of the real-life situation that the real relationships are hard to express at all in GEDCOM and impossible to express completely in GEDCOM. The first thing we notice is that every separate assertion links to a citation, and since assertions can't be exported from the UNIGEDS structure to the GEDCOM structure, a lot of extra citation references will have to be weeded out. They would not be extra if assertions could be expressed correctly in GEDCOM, but since assertions have to be sent on to the exceptions log, the resulting seemingly extraneous citation references would make no sense. BUT WHY NOT JUST PUT THE DATA.TEXT construct in the GEDCOM anyway? While we're at it, is it possible to write one method which can equally put a GEDCOM "source_citation" subordinate to INDI.EVEN, INDI.NAME, INDI, FAM, OBJE, or NOTE?
This example shows a GEDCOM "source_citation" subordinate to an individual event (INDI.EVEN|DEAT|BIRT|etc). The source_citation can also be subordinate to INDI.NAME, INDI, FAM, OBJE, or NOTE.
0 @I1@ INDI
1 RESI shoe shop with apartment \\ person-to-event is one-to-many
2 SOUR @S6@ \\ event-to-source has no direct relationship; should be event-to-assertion which has a one-to-many relationship, because an event is really a conclusion about an event, so without a source mentioning the event, no conclusion would have ever been entertained, much less concluded
3 PAGE E.D. 10-10, Sheet 7A \\ source-to-citation is one-to-many
3 DATA \\ undefined hoop of nothingness to jump through
4 TEXT xyz abc \\ citation-to-assertion is one-to-many, should be PAGE.TEXT with no DATA tag
4 DATE 14 JUN 1833 \\ date the assertion was made has a one-to-one relationship with the assertion, should be subordinate to TEXT
3 QUAY \\ confidence in the assertion has a one-to-one relationship with the assertion, should be subordinate to TEXT
Corrected GEDCOMoid using custom tags:
0 @I1@ INDI
1 RESI shoe shop with apartment \\ person-to-event is one-to-many
2 _TEXT \\ event-to-assertion is one-to-many
3 _DATE 14 JUN 1833 \\ assertion-to-assertion_date is one-to-one
4 _QUAY \\ assertion-to-assertion_surety is one-to-one
But...
? SOUR @S6@ \\ source isn't subordinate to anything
3 PAGE E.D. 10-10, Sheet 7A \\ source-to-citation is one-to-many AND citation-to-assertion is one-to-many. HOW CAN THAT BE EXPRESSED?
Therefore we have no other option than to learn something new, something which no one including myself has yet realized, about the GEDCOM problem, and that is this: what DOES GEDCOM express about cardinality?
It seems like GEDCOM pretends everything is a one-to-many relationship (see chart), which often works out since one-to-many is a common relationship between data pairs and when the designers of GEDCOM got lucky the subordinate structure wasn't too simple or too linear to express some one-to-one and many-to-many relationships also. But GEDCOM only has one way to show data relationships: subordination, which is a linear chain with wobbly branches, like those hanging mobiles we made in 5th grade art class which were so hard to balance, and the strings got tangled up in a slight breeze. See how one thing is linked to many in every single case? This implies a lack of any structure devoted to one-to-one or many-to-many links, as if every link in the world is supposed to be one-to-many. We just "have to" use it anyway, because it's "all there is". Why has computer genealogy forgotten the wider field of computing, which uses SQL to express related data, instead of this wobblesome noise:
INDI
_______|________
| | | |
RESI NAME EVEN DEAT
_______|________ ___|_______
| | | | | | |
SOUR
Here's the problem with that: real data relationships--with a solid, definable, definite, objective description--exist between the members of a data pair, like "event-to-assertion" and "citation-to-assertion". Real relationships are not accurately described as a chain of nesting or subordination, but a web of links going in every direction. This web of related facts adds up to what we perceive as reality or history. This web is a sturdy, flexible structure. It can grow or shrink, it can lose a link without losing its meaning or value. But GEDCOM's linear structure depends on not losing a link, so the fact that it misrepresents reality makes it useless for conveying the more intricate details of the world, so the more detailed the attempted description, the more complex becomes its uselessness. Correctly-defined relationships between two data types, on the other hand, remain simple and easily defined, and this is why reality can be described at all: as you look closer from the right perspective, things keep getting simpler. When the opposite happens, you need to clean your glasses, or else get a job writing rules for everyone else to try and follow.
The first main problem with GEDCOM's source_citation is that event-to-assertion and citation-to-assertion are both one-to-many relationships. With assertion being the many side of two superior elements, a linear system of subordination breaks down. You need SQL's omnidirectional web of simple, small, easily defined pairs of data types. There's no way to say, in GEDCOM, that assertion is subordinate to both event and citation, because subordinate (which implies a chain with wobbly branches) is the wrong description. The right word is "link", which implies the strongest structure in the world, for its weight: a net or web. Like the geodesic dome. The weightiness of GEDCOM, for its strength, has been well-noted.
We have only considered sourcing of events, but names are the other thing, besides event-parts like date, place, particulars, age, and roles, which need to be sourced. (GEDCOM also allows sourcing of OBJE (multimedia files), NOTE, INDI (person) or FAM (couple), all of which seem sort of abstract if not downright weird.) But clearly there are at least three superior tags all competing to be the parent tag of TEXT in a system of linear subordination which won't allow tags to have more than one parent, because name-to-assertion is also a one-to-many relationship. In UNIGEDS' assertion table, there are foreign key columns for name_id, event_id, and citation_id. Problem solved.
The second main problem with GEDCOM'S source_citation is that source, in the real world, is linked to nothing else in the source_citation tag chain except citation. That's why my attempt above to redo the source citation with custom tags failed (unless I'm just an idiot): the linearity of the model didn't give source a place to be. This is not true of SQL, where everything has exactly one place to be, because everything that needs an ID has an ID, so can be referenced anywhere.
Now that we know the flavor of the stew we're chopping our data into, we still want to know: can the odd GEDCOM-specified constructs be used anyway, by biting the bullet and dumbing down the data structure? If not, then the whole sourcing feature either needs to be 1) put into an exceptions report, or 2) expressed with custom tags, which we've shown above won't work with a linear chain or subordinate data. So the only option is to dumb down the data structure and use DATA.TEXT for assertions, or put the entire source_citation into an exceptions log.
Not that we want to play the blame game, but can anyone see why, with GEDCOM as a supposed data standard, assertions have been barely touched upon in most genieware? If we want customers, we need import/export, but if we want better genieware, we need better data structure for import/export.