First questions: the nature of the RG2 material is such that it always references the material appearing in RG1. The RG2 material must have at least some of the animals from RG1 in its texts – this is the very nature of the data. How do I know? This is niche government data, highly structured, has followed this format for decades (i.e. pre-digitization), and the material in the 2nd RG necessarily references the material in the first RG b/c the 2nd data set is created by the government specifically for this purpose (we’re not actually dealing with animals in our data sets, it’s just a convenient way to simplify this).
As for your second question – whether two animals might appear one after the other – this is a possibility. However, consecutive animals will be always separated by commas… whereas animals appearing only one at a time never do. We could exclude these animal-lists by omitting animal + comma. There are also additional things we can do.
For now, I’d be happy just to get the texts in between these animals appearing and even if there are complications such as multiple animals and so on that this first step didn’t address, I’m confident we could get there with subsequent steps. It’s this piece of getting the in-between data to show in the right cells that’s giving me grief.
Finally - if none of this persuades you that the issue is not so much with how the data is initially captured, then let’s explore that. Currently the data is being imported manually via API. We could do it via CSV but this is not at all ideal due to the non-automated dimension of such ‘downloading’ the data into CSV, cleaning all the data, and so on. However… perhaps as per your solution of attaching ‘animal’ to the text data, perhaps there is an efficient way to attach the correct animals to the texts within the CSV before uploading it.
Again - I’d far rather do it via API then parsing through RGs (or text box, and so on).