How to read GEDCOM faster
Feb 27, 2024 15:47:08 GMT -8
Post by Uncle Buddy on Feb 27, 2024 15:47:08 GMT -8
This information is from Tamura Jones. This suggests that my GEDCOM import program could be improved in the following way.
In my gedcom_import.py, the data is read in steps, and a little data is handled in each step. He agrees with this, but disagrees with saving the file after each step. I haven't gotten around to trying very large GEDCOM files, and I have a lot of RAM. What about old computers, small computers, and big files? Saving the file in a newly translated form several times throughout the import process would require disc space and the time/resources required to write to disc. The correct way to import GEDCOM in stages would be to keep the data in memory while manipulating it. Tamura said that this is how it's done by programs that import GEDCOM quickly.
He didn't know how Python might go about this. My preliminary research indicates that the Pandas library might be one solution. Pandas can be imported to Python to read a large file in chunks. This video is one of many demonstrations. It shows how to get the data broken up into chunks and then concatenated back together. I haven't tried it yet. (I'm off the GEDCOM warpath right now, working on editing all my YouTube videos.)
Tamura also suggested a way to generate very large GEDCOM files for testing purposes, using his program GedFan which is described in these articles:
The GEDCOM Fan Creator
GedFan 0.4.0.0
"The series of increasingly larger files that GedFan generates allow establishing a genealogy application's fan value." The fan value is a rating of how big a file your import program can handle before it becomes inadequate to the task.
In my gedcom_import.py, the data is read in steps, and a little data is handled in each step. He agrees with this, but disagrees with saving the file after each step. I haven't gotten around to trying very large GEDCOM files, and I have a lot of RAM. What about old computers, small computers, and big files? Saving the file in a newly translated form several times throughout the import process would require disc space and the time/resources required to write to disc. The correct way to import GEDCOM in stages would be to keep the data in memory while manipulating it. Tamura said that this is how it's done by programs that import GEDCOM quickly.
He didn't know how Python might go about this. My preliminary research indicates that the Pandas library might be one solution. Pandas can be imported to Python to read a large file in chunks. This video is one of many demonstrations. It shows how to get the data broken up into chunks and then concatenated back together. I haven't tried it yet. (I'm off the GEDCOM warpath right now, working on editing all my YouTube videos.)
Tamura also suggested a way to generate very large GEDCOM files for testing purposes, using his program GedFan which is described in these articles:
The GEDCOM Fan Creator
GedFan 0.4.0.0
"The series of increasingly larger files that GedFan generates allow establishing a genealogy application's fan value." The fan value is a rating of how big a file your import program can handle before it becomes inadequate to the task.