Structuring Sources, Citations, etc. in the Database
Jul 7, 2023 6:10:53 GMT -8
Post by Uncle Buddy on Jul 7, 2023 6:10:53 GMT -8
There's plenty of information available on how to do--and overdo--citations. Of course the big websites will provide you with an exact way to cite their site which would of course constitute an advertisement for their site as well as scare away anyone who might have been considering actually reading through your data, but that is neither here nor there. This post is not about how to do citations, except to say that it's up to you. As for what is crucial and important, well it is your hobby, so suit yourself. Genieware should not break citations up into parts such as author, title, page, etc., and genieware should not suggest how to format your data, nor force any particular format on you. A citation is just plain text, but one of the things that GEDCOM and most or all genieware doesn't recognize is that a citation should only be stored once, and each citation should have a unique primary key. As a result, most genieware is clogged with repetition and most of the tedium in sourcing is caused by having to copy-paste identical text over and over. This also slows down data transfer with GEDCOM.
I'm here today to discuss what should be taking place behind the scenes, in the data storage structure, when the genealogist enters the citation, in any format he chooses, so that he or viewers of his tree can find his sources and check his conclusions.
I have not written and rewritten this portion of Treebard and UNIGEDS multiple times, like I have with other features. I saved it for later, worked on other things first, and then when I was finally ready to tackle it, I turned on the webcam and let you watch. I posted about 204 videos of myself designing and writing the assertions/sources/citations/repositories feature's first draft.
As I approach the conclusion of Treebard Development Chapter 2, I don't find the first draft of Assertions & Co. to be good enough. In particular, the notion of "locator" was fuzzy at the time and therefore might have found an odd place to live in the database. I won't go back over this old ground, as I prefer to move forward into hopefully a more clear picture of how these intertwining features need to work. Not that I'm an expert yet, this is only the second draft, and I'm still thinking, haven't started it yet.
It occurred to me that there might be two or more kinds of locators. A locator for a source and a locator for a citation. For example, at archive.org there would be one URL that goes to the general page for the 1880 US Census, one URL that goes to the general page for a reel of microfilm that contains County X in 1880 as well as other counties, and another URL for the actual web page showing the census page relevant to the assertion (what the source says).
So I thought about having a locator field that could be linked to either a source or a citation. But then it occurred to me that maybe what I was calling a "locator" is really just a part of a citation. A citation has parts: a publisher, an author, a date, a page or chapter or whatever. Isn't a URL or an ISBN number just a part of a citation? Well maybe, but let's face it. These citations can get too long, too detailed. If there's a citation part that should uniquely be broken out into its own category, then maybe it should be.
I had been thinking of locators as only a call number in a library or only a URL on a website, with the locator thus being linked to a repository and a citation, not a source. I even thought of nesting sources inside repositories and nesting citations inside sources. But this could get rigid fast. I want to avoid structures that are cute-but-useless.
Finally I had what might be an original thought, I'm not sure since I tend to think too much vs. researching too much, so this could have come up before somewhere else.
I was thinking about having a citation for a source as well as a citation for the assertion, the details of what the source says that are relevant to the family tree. Then it occurred to me that locators and citations are not the same thing. For this reason I'm now thinking that there should be separate categories for locators and citations, as it was designed back when I was making that series of 204 videos.
A citation helps you search for and maybe find a source by giving you information about the source itself. Everything from publisher to chapter, line, and page, it's all about the source and a location within that source. But a locator is something completely different. A locator takes you directly to that source, and gives you no other information about the source. If you hand a librarian a scrap of paper with a call number on it, the librarian can run and get that book out of the stacks for you, and you'll know whether it's the book you wanted, but the librarian didn't need to know the name of the book, if the call number was right and complete. A URL works the same way. It takes you straight to the referenced source, generally without telling you what the source is or anything about it. So a citation is about the source, but a locator is a magic button or label coded with a map to the source.
So a locator could be kept in the database separately from the citation, thereby making the citation shorter, which is good for our GUI, it makes things more readable. In support of this argument, there's the cardinality of the situation as it relates to database structure. For any given citation, if there are two copies, two versions, two different resolutions for a scanned document, etc., there are probably two different locators, even if the two are at the same repository. Library coding systems such as the Dewey decimal system have standards that all repositories should follow, but they aren't forced to follow them exactly, so there needs to be a separate field for locators so that two identical citations can be identical, in case we need to consult two versions of the same source. This was necessary in the following case, for example.
A man had a short marriage that was not documented very well, and ancestry.com had an illegible scan of the marriage license as well as someone's ridiculous transcription of the bride's name, ridiculous because it was a wild guess. Familysearch.org has the same document, clear as day, and perfectly easy to read. The two transcriptions were not even close. The citations are the same, they are the same document, same book, same page, same volume. The locator is different. Database-wise, we do not want to store two nearly identical citations for the same document, different only because the locator is wrongly pasted onto the citation. We want to store one citation one time and use a foreign key reference to that citation with two different locators (URLs at the two different websites). When I create this man's tree I will record both versions of the document so I can warn folks away from the bad transcription at ancestry.com. People have a tendency of assuming that the transcription and the source are the same thing. Then a dozen people copy the bad information from the mistaken tree and suddenly it's accepted as fact because everyone's saying the same thing.
So that's where it stands right now. Tomorrow I plan to start the rewrite of the assertions/sources/citations/repositories feature of Treebard and UNIGEDS. Not with the camera running this time.
I'm here today to discuss what should be taking place behind the scenes, in the data storage structure, when the genealogist enters the citation, in any format he chooses, so that he or viewers of his tree can find his sources and check his conclusions.
I have not written and rewritten this portion of Treebard and UNIGEDS multiple times, like I have with other features. I saved it for later, worked on other things first, and then when I was finally ready to tackle it, I turned on the webcam and let you watch. I posted about 204 videos of myself designing and writing the assertions/sources/citations/repositories feature's first draft.
As I approach the conclusion of Treebard Development Chapter 2, I don't find the first draft of Assertions & Co. to be good enough. In particular, the notion of "locator" was fuzzy at the time and therefore might have found an odd place to live in the database. I won't go back over this old ground, as I prefer to move forward into hopefully a more clear picture of how these intertwining features need to work. Not that I'm an expert yet, this is only the second draft, and I'm still thinking, haven't started it yet.
It occurred to me that there might be two or more kinds of locators. A locator for a source and a locator for a citation. For example, at archive.org there would be one URL that goes to the general page for the 1880 US Census, one URL that goes to the general page for a reel of microfilm that contains County X in 1880 as well as other counties, and another URL for the actual web page showing the census page relevant to the assertion (what the source says).
So I thought about having a locator field that could be linked to either a source or a citation. But then it occurred to me that maybe what I was calling a "locator" is really just a part of a citation. A citation has parts: a publisher, an author, a date, a page or chapter or whatever. Isn't a URL or an ISBN number just a part of a citation? Well maybe, but let's face it. These citations can get too long, too detailed. If there's a citation part that should uniquely be broken out into its own category, then maybe it should be.
I had been thinking of locators as only a call number in a library or only a URL on a website, with the locator thus being linked to a repository and a citation, not a source. I even thought of nesting sources inside repositories and nesting citations inside sources. But this could get rigid fast. I want to avoid structures that are cute-but-useless.
Finally I had what might be an original thought, I'm not sure since I tend to think too much vs. researching too much, so this could have come up before somewhere else.
I was thinking about having a citation for a source as well as a citation for the assertion, the details of what the source says that are relevant to the family tree. Then it occurred to me that locators and citations are not the same thing. For this reason I'm now thinking that there should be separate categories for locators and citations, as it was designed back when I was making that series of 204 videos.
A citation helps you search for and maybe find a source by giving you information about the source itself. Everything from publisher to chapter, line, and page, it's all about the source and a location within that source. But a locator is something completely different. A locator takes you directly to that source, and gives you no other information about the source. If you hand a librarian a scrap of paper with a call number on it, the librarian can run and get that book out of the stacks for you, and you'll know whether it's the book you wanted, but the librarian didn't need to know the name of the book, if the call number was right and complete. A URL works the same way. It takes you straight to the referenced source, generally without telling you what the source is or anything about it. So a citation is about the source, but a locator is a magic button or label coded with a map to the source.
So a locator could be kept in the database separately from the citation, thereby making the citation shorter, which is good for our GUI, it makes things more readable. In support of this argument, there's the cardinality of the situation as it relates to database structure. For any given citation, if there are two copies, two versions, two different resolutions for a scanned document, etc., there are probably two different locators, even if the two are at the same repository. Library coding systems such as the Dewey decimal system have standards that all repositories should follow, but they aren't forced to follow them exactly, so there needs to be a separate field for locators so that two identical citations can be identical, in case we need to consult two versions of the same source. This was necessary in the following case, for example.
A man had a short marriage that was not documented very well, and ancestry.com had an illegible scan of the marriage license as well as someone's ridiculous transcription of the bride's name, ridiculous because it was a wild guess. Familysearch.org has the same document, clear as day, and perfectly easy to read. The two transcriptions were not even close. The citations are the same, they are the same document, same book, same page, same volume. The locator is different. Database-wise, we do not want to store two nearly identical citations for the same document, different only because the locator is wrongly pasted onto the citation. We want to store one citation one time and use a foreign key reference to that citation with two different locators (URLs at the two different websites). When I create this man's tree I will record both versions of the document so I can warn folks away from the bad transcription at ancestry.com. People have a tendency of assuming that the transcription and the source are the same thing. Then a dozen people copy the bad information from the mistaken tree and suddenly it's accepted as fact because everyone's saying the same thing.
So that's where it stands right now. Tomorrow I plan to start the rewrite of the assertions/sources/citations/repositories feature of Treebard and UNIGEDS. Not with the camera running this time.