Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create merge() to merge datasets #75

Open
4 tasks
jimcasaer opened this issue Mar 17, 2023 · 12 comments · May be fixed by #112
Open
4 tasks

Create merge() to merge datasets #75

jimcasaer opened this issue Mar 17, 2023 · 12 comments · May be fixed by #112
Assignees
Labels
function:transform Functions round_coordinates(), merge_camtrapdp(), etc.
Milestone

Comments

@jimcasaer
Copy link

jimcasaer commented Mar 17, 2023

Note by @peterdesmet: see #75 (comment) for recent thoughts.


Quite often we want to merge the exports of several research projects (more and more for multi-site research), all using Camtrap-DP as export format, mostly all resulting from agouti.
Though manually this can be done by using several times read_camtrap_dp and subsequently merging the different resulting csv-files, I was wondering if it would not be better to add a function to camtraptor that automatically

  • merges different data packages resulting from read_camtrap_dp
  • adds a column to the deployments.csv using the project_name as variable
  • adds a column to the deployments.csv using the project location / study site
  • the function name could be "join_exports" or something similar but this I leave up to Peter :-)
@peterdesmet
Copy link
Member

I agree with the need of this functionality and that it is best supported by a function in Camtrap DP. The implementation details are to be discussed, but I'd prefer to tackle this after https://github.com/inbo/camtraptor/milestone/3

@damianooldoni
Copy link
Member

A logic request, definitely. I agree with @peterdesmet: it's better to add this functionality after the big refactoring.

@jimcasaer
Copy link
Author

Just to have an idea (and planning the next analyses) -- what would be the timing for this new feature ?

@damianooldoni
Copy link
Member

Agreed with @peterdesmet to add this functionality to v1 and to give it a higher priority.

@peterdesmet
Copy link
Member

Suggested in camtraptor July 2023 coding sprint

  • Call this function merge()
  • Limit it to two packages merge(package, package2) to avoid complex operations, testing and error handling. Users can add additional packages using pipe: package1 %>% merge(package2) %>% merge(package3)

@peterdesmet peterdesmet changed the title new function : merging packages Create merge() to merge datasets Jul 15, 2023
@PietrH
Copy link
Member

PietrH commented Jul 20, 2023

A package that is the result of a merge() #75 should combine its metadata (not sure how)

From #74

@peterdesmet peterdesmet transferred this issue from inbo/camtraptor May 31, 2024
@peterdesmet
Copy link
Member

I have described the functionality in more detail at tdwg/camtrap-dp#380. I think the function should be called merge_camtrapdp().

@peterdesmet peterdesmet added the function:transform Functions round_coordinates(), merge_camtrapdp(), etc. label May 31, 2024
@peterdesmet peterdesmet added this to the Version 0.x.0 milestone May 31, 2024
@sannegovaert sannegovaert linked a pull request Jul 25, 2024 that will close this issue
@peterdesmet
Copy link
Member

peterdesmet commented Aug 2, 2024

Task list by @sannegovaert:

  • Deployments: combination of deployments (but check that deploymentID remains unique)
  • Media: combination of media (but check that mediaID remains unique)
  • Observations: combination of media (but check that observationID remains unique)
  • name: a new name is created (old values are not retained)
  • id: a new ID is created. The original ids are stored in relatedIdentifiers
  • created: reset to current timestamp (old values are not retained)
  • title: a new title (old values are not retained)
  • contributors: a combination is made, duplicates (on some or all fields) are removed and roles are combined (in Frictionless v2). The order is not retained.
  • description: a new one is generated, potentially still listing the previous descriptions
  • version: is reset to 1.0 (old values are not retained)
  • keywords: a combination is made, duplicates are removed
  • image: is removed (old values are not retained)
  • homepage: is removed (old values are not retained)
  • sources: a combination is made, duplicates are removed
  • licenses: a combination is made, duplicates are removed. If e.g. two different license with scope: media are listed, it won't be clear which one applies to which media.
  • bibliographicCitation: is removed (old values are not retained)
  • project: ideally this becomes an array of projects and deployments have a projectID to link to the correct project info.
  • coordinatePrecision: is reset to least precise precision (old values are not retained)
  • spatial: is reset based on new deployments (old values are not retained)
  • temporal: is reset based on new deployments (old vales are not retained)
  • taxonomic: a combination is made, duplicates are removed.
  • relatedIdentifiers: a combination is made, duplicates are removed
  • references: a combination is made, duplicates are removed

@peterdesmet
Copy link
Member

Merge should also handle "additional resources". Currently only the additional resource of x are kept

  • Keeping both makes sense, but if they are named the same, we don't know what to do
  • Even if we keep them, they might contain identifiers that refer to dep, med, obs which become invalid

I think that the cleanest solution is therefore to remove additional resources, with a warning that lists which ones are removed, see this code to find them:

extra_resources <- resources[!resources %in% tables]

In the tests, I would remove individuals from x so you don't have to deal with suppressWarnings() all the time.

@PietrH
Copy link
Member

PietrH commented Oct 17, 2024 via email

@peterdesmet
Copy link
Member

peterdesmet commented Oct 17, 2024

Could you prompt the user if they are named the same?

As in, ask for a decision? What options do you provide them?

One alternative is to rename the additional resources by prefixes the dataset identifier (cf. the deployments). I would then do this for all resources, so you see where they came from). This doesn't prompt user and shouldn't trigger a warning.

@PietrH
Copy link
Member

PietrH commented Oct 21, 2024

If they are named the same, ask the user to name the tables.

I think there is value in being consistent with the deployments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
function:transform Functions round_coordinates(), merge_camtrapdp(), etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants