Relationships of genealogy data in a SQL database
Jul 17, 2024 23:44:41 GMT -8
Post by Uncle Buddy on Jul 17, 2024 23:44:41 GMT -8
REPRESENTING RELATIONSHIPS AMONG ELEMENTS OF GENEALOGY IN A SQL DATABASE
The elements of genealogy are those things which get their own database table, for the most part, and are endowed with unique ID numbers. The primary elements of genealogy are those things that pre-exist genealogy, like persons, places, and sources. Some elements of genealogy, such as couples and other relationships, are compound elements because they are comprised of two or more elements.
A database table exists for each of the elements so that each unique member
of each element category will have its own primary key or unique ID number
within that category. To link two elements together, they are in the same row of a database table. Two data items are in the same row literally, in the case of a one-to-one relationship such as person-to-gender. "Mary" and "female" go in the same row of the person table.
But in the case of person-to-name, that's a one-to-many relationship, so person and name each need their own table in the database. A foreign key from the person table will be used in a person_id column of the name table. But how do you know that it shouldn't be the other way around? Why can't you put the name_id in the person_table? The key to it is that one person can have many names, but each name refers to only one person. So the primary key from the person table is used as a foreign key in the name table as many times as necessary. It won't work the other way around.
The definition of a relationship between two data types is its "cardinality". The cardinality types are one-to-one, one-to-many, and many-to-many.
Assertions are basic genealogy elements which exist separately from the genealogist's conclusions. The genealogist's conclusions are OF genealogy, they are secondary since they don't pre-exist genealogy. They can exist without evidence but can optionally be influenced by assertions. Assertions--what a source says--exist because a citation--a place within a source--exists. Citations can't exist without sources. Since we're talking about what pre-exists what, here is the progression, according to the real world:
--source must exist in order for
--citation to exist, and citation must exist in order for
--assertion to exist
A source can have many citations, but each citation refers to only one source. A citation can yield many assertions, but each assertion is related to only one citation. So source-to-citation and citation-to-assertion are both one-to-many relationships. To represent a one-to-many relationship, a SQL database has a table for each element, and the foreign key for the one side of the relationship is used in a column in the many side of the relationship. So the citation table has a column for foreign keys from the source table. The assertion table has a column for foreign keys from the citation table. This is the right way to represent the data relationships in a database table schema, in order to mirror the structure of the real world.
Since the assertion is what the source says at the citation, we have to ask what is the topic of what's being said by the assertion. An assertion is a claim, a statement of fact or supposed fact, about something. What is this something? Assertions often make statements about the elements event, attribute, and name. We can lump events and attributes together in the database table called "event". In the user interface we can separate them but that's another discussion. As data they can be treated exactly the same, so they should be.
Once again we have to determine cardinality before we know where to put the foreign key. Does event_id go in the assertions table, or the other way around?
An event has several parts, including date, place, age, particulars, and role, about which an assertion can be made. Since a conclusion is about one of these items, an assertion has to be about one of these items. We treat these assertion types the same in some ways, because they're just text from the source--literal or paraphrased--but in other ways they need to be stored separately, because the user will want to be reminded of how he came to a conclusion about a place OR a date, not both. A source might be wrong about the place but right about the date, so assertions are evaluated separately and can't be lumped together.
To answer the question about cardinality, an event can be linked to any number of date assertions, any number of place assertions, etc., but each assertion refers to only one event. So the cardinality of event-to-assertion is one-to-many. Therefore the event_id has to be used as a foreign key in the assertion table. While a name is a sort of special attribute of a person, it's not like other attributes. Person names are their own element in genealogy, so they get their own table. Each name gets its own primary key. A person who has a birth name and two aliases will have one person ID and three name IDs. The person ID is repeated in three rows of the name table. Primary keys can exist only once in their own table. So getting this backwards doesn't work.
The cardinality of name-to-assertion is the same as event-to-assertion, and this will be the same for other elements that assertions can make statements about. So the assertion table has a column for name_id and a column for event_id, and either of these foreign key columns--but not both--requires a reference for each assertion.
I don't know if you can tell just by reading this, but once you've tried to do it in a real database, you find that there is no way to accurately mirror the real world in a database about genealogy sources without an assertion table. The assertion text itself can be left blank when there's no confusion for vying assertions to try and clear up, and who wants to say a bunch of obvious things anyway? But the assertion table and its columns and rows must exist in order to correctly link events and names to sources and citations. Anything else is pretending based on assumptions, which are just glossed over information, cherry-picked for convenience. That is the state of the genieware industry, and that's why Treebard and UNIGEDS exist. As a non-commercial entity, the Treebard project has taken the time to get UNIGEDS right.
For an example of many-to-many relationships, we can look at notes. When information doesn't fit in another category, a note can take up the slack. In other genieare, I used to have to copy and paste notes from element to element, so in UNIGEDS I created a proper schema for notes so any note could be linked to any number of elements and any element could be linked to any number of notes. That's an example of a many-to-many relationship. Most columns in many-to-many tables (or "junction tables") are foreign key columns. In UNIGEDS there's a notes_links table with an obligatory column for note_id's foreign key and several other columns such as source_id and name_id so that any element can be linked to any note by putting that element in the same row of notes_links as the foreign key for the note_id.
The elements of genealogy are those things which get their own database table, for the most part, and are endowed with unique ID numbers. The primary elements of genealogy are those things that pre-exist genealogy, like persons, places, and sources. Some elements of genealogy, such as couples and other relationships, are compound elements because they are comprised of two or more elements.
A database table exists for each of the elements so that each unique member
of each element category will have its own primary key or unique ID number
within that category. To link two elements together, they are in the same row of a database table. Two data items are in the same row literally, in the case of a one-to-one relationship such as person-to-gender. "Mary" and "female" go in the same row of the person table.
But in the case of person-to-name, that's a one-to-many relationship, so person and name each need their own table in the database. A foreign key from the person table will be used in a person_id column of the name table. But how do you know that it shouldn't be the other way around? Why can't you put the name_id in the person_table? The key to it is that one person can have many names, but each name refers to only one person. So the primary key from the person table is used as a foreign key in the name table as many times as necessary. It won't work the other way around.
The definition of a relationship between two data types is its "cardinality". The cardinality types are one-to-one, one-to-many, and many-to-many.
Assertions are basic genealogy elements which exist separately from the genealogist's conclusions. The genealogist's conclusions are OF genealogy, they are secondary since they don't pre-exist genealogy. They can exist without evidence but can optionally be influenced by assertions. Assertions--what a source says--exist because a citation--a place within a source--exists. Citations can't exist without sources. Since we're talking about what pre-exists what, here is the progression, according to the real world:
--source must exist in order for
--citation to exist, and citation must exist in order for
--assertion to exist
A source can have many citations, but each citation refers to only one source. A citation can yield many assertions, but each assertion is related to only one citation. So source-to-citation and citation-to-assertion are both one-to-many relationships. To represent a one-to-many relationship, a SQL database has a table for each element, and the foreign key for the one side of the relationship is used in a column in the many side of the relationship. So the citation table has a column for foreign keys from the source table. The assertion table has a column for foreign keys from the citation table. This is the right way to represent the data relationships in a database table schema, in order to mirror the structure of the real world.
Since the assertion is what the source says at the citation, we have to ask what is the topic of what's being said by the assertion. An assertion is a claim, a statement of fact or supposed fact, about something. What is this something? Assertions often make statements about the elements event, attribute, and name. We can lump events and attributes together in the database table called "event". In the user interface we can separate them but that's another discussion. As data they can be treated exactly the same, so they should be.
Once again we have to determine cardinality before we know where to put the foreign key. Does event_id go in the assertions table, or the other way around?
An event has several parts, including date, place, age, particulars, and role, about which an assertion can be made. Since a conclusion is about one of these items, an assertion has to be about one of these items. We treat these assertion types the same in some ways, because they're just text from the source--literal or paraphrased--but in other ways they need to be stored separately, because the user will want to be reminded of how he came to a conclusion about a place OR a date, not both. A source might be wrong about the place but right about the date, so assertions are evaluated separately and can't be lumped together.
To answer the question about cardinality, an event can be linked to any number of date assertions, any number of place assertions, etc., but each assertion refers to only one event. So the cardinality of event-to-assertion is one-to-many. Therefore the event_id has to be used as a foreign key in the assertion table. While a name is a sort of special attribute of a person, it's not like other attributes. Person names are their own element in genealogy, so they get their own table. Each name gets its own primary key. A person who has a birth name and two aliases will have one person ID and three name IDs. The person ID is repeated in three rows of the name table. Primary keys can exist only once in their own table. So getting this backwards doesn't work.
The cardinality of name-to-assertion is the same as event-to-assertion, and this will be the same for other elements that assertions can make statements about. So the assertion table has a column for name_id and a column for event_id, and either of these foreign key columns--but not both--requires a reference for each assertion.
I don't know if you can tell just by reading this, but once you've tried to do it in a real database, you find that there is no way to accurately mirror the real world in a database about genealogy sources without an assertion table. The assertion text itself can be left blank when there's no confusion for vying assertions to try and clear up, and who wants to say a bunch of obvious things anyway? But the assertion table and its columns and rows must exist in order to correctly link events and names to sources and citations. Anything else is pretending based on assumptions, which are just glossed over information, cherry-picked for convenience. That is the state of the genieware industry, and that's why Treebard and UNIGEDS exist. As a non-commercial entity, the Treebard project has taken the time to get UNIGEDS right.
For an example of many-to-many relationships, we can look at notes. When information doesn't fit in another category, a note can take up the slack. In other genieare, I used to have to copy and paste notes from element to element, so in UNIGEDS I created a proper schema for notes so any note could be linked to any number of elements and any element could be linked to any number of notes. That's an example of a many-to-many relationship. Most columns in many-to-many tables (or "junction tables") are foreign key columns. In UNIGEDS there's a notes_links table with an obligatory column for note_id's foreign key and several other columns such as source_id and name_id so that any element can be linked to any note by putting that element in the same row of notes_links as the foreign key for the note_id.