-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFO: Directory and file name structure for gene family collections #192
Comments
Fine with me! I load 'em whatever they're called! Just be sure to rename the phylotree nodes when you rename the families. |
I don't have any objections, although I have some ambivalence about including a "strain" slot in the id since presumably gene families will always be mixed strain and I think we decided we wouldn't include that in pangene set identifiers for the same reason? I think we argued that there was no compelling reason that identifier schemes for different data types needed to be similar; but I could be misremembering. In any case, I agree we should grandfather the old identifiers in the files- it would be a massive PITA to change them in all of the GFA files and the various places those have been consumed! |
@adf-ncgr - Good point about "mixed" in the collection name. Would be nice for it to be something more useful. |
Well, I don't know how inspiring this is, but all I really meant to suggest is that we remove that "field" from the ids (not replace it with something besides "mixed"). So, just as we have Arachis.pan2 we'd have legume.fam2 and leave it at that. Maybe we could use an extra bit of yuck for some other purpose here, but it doesn't seem necessary to have it be any fixed length to be considered "yucky enough". |
That's inspiration enough for me. I'll go with that. |
Adding @svengato here, since this change impacts the Funnotate/phylogram & Lorax. The "problem" I am trying to address with this renaming is that My initial proposal (which I partly "implemented" with a renaming) was e.g., I also take @adf-ncgr's point: "it would be a massive PITA to change them in all of the GFA files and the various places those have been consumed!" So, I think there are two questions here: Whatever we decide on point 2, I am willing to do the renaming (either forward to legume.fam1.M65K, or backward to legume.genefam.fam1.M65K). [edit - since I can't do things right the first time] |
WARNING: expanding to DS in general I think the opposite: genome collections are incompletely identified. The collection Hwangkeum.gnm1.4S83 says nowhere that it is, in fact, a genome. Yes, it has an assembly version, gnm1, but that could be ScoobyDoo123 if there were some reason to preserve that assembly version from the original source. Same with annotation collections. Hwangkeum.gnm1.ann1.1G4F is known to US to be an annotation because it has an extra field with an annotation version and no other collection-defining identifier. But that could also be CruellaDeVille7. We only know it's an annotation because it does NOT have something like .mrk. or .gwas. to indicate that it is something else. So WE know that Wm82.ScoobyDoo123.CruellaDeVille7.ABCD is an annotation collection, but there is no way anyone could discern that if they weren't working for LIS. So I think we have some inconsistency/incompleteness in our collection naming which renders those collections non-Findable, which is the first F in FAIR. (Also, as you know, I think KEY4 is spurious and clutters/complicates our naming. So far I have found no actual purpose for it since the stuff preceding the KEY4 is already unique.) UPDATE: OK, technically they're findable because a URL is a URL. Could be a random alphanumeric string for that matter. But the collection identifiers do not always self-identify what they are, or to which genus and/or species they belong, for that matter. But they include four characters that serve no identifying purpose whatsoever. |
@sammyjava - I think the counterargument regarding incompletely identified Hwangkeum.gnm1.4S83 is that it sits in You might say "Yes, but how about if someone receives a bare-naked I would say: No problem, because the README within the collection, README.Hwangkeum.gnm1.4S83.yml, describes the contents: And my argument again for the utility of 4S83 here: It is a funky string that aids users in Findability and provenance. If someone stumbles across a file |
couple of quick comments (NOT intended to prolong the agony!):
|
But they don't know where to get the file from. And computer programs shouldn't have to parse the internals of READMEs to determine the provenance of identifiers. Etc. I think you're thinking in terms of human beings looking at files and directories, not automated processes. I write automated processes and I find a lot of difficulty with these issues. Yes, I can drill down to a README because I'm a human being reading it off of a URL in my browser, but that's not what I'm talking about. I'm talking about well-self-identifying identifiers. That's all. From the identifier of a tarball (as you suggest) one should be able to say: "this is a genome assembly for the genus Glycine, species max, accession Williams82, version gnm1" simply from the fields in the identifier. I don't expect to convince you of any of this. But you know how I feel about it. P.S. You'll never convince me that KEY4 is really useful and worthy of a field in our identifiers. So I'll just drop that complaint. |
Yeah, but that's totally inconsistent with the practice for diversity, expression, gwas, maps, markers, pangenomes, qtls, synteny, Why is it OK to use inconsistent naming syntax at LIS? |
|
FIne. I'm out. |
In jest I hope. Seriously, I'd like to try for more consistency -- and in fact, that's the intent of this RFO; but the real balance to be struck is with tradeoffs between semantically opaque UUIDs and semantically meaningful ones - at the cost of the potential for length and messiness of the meaningful identifiers. Things like human names and quirky publication choices give us things like I am also open to large, fundamental changes -- but recognizing that larger changes have larger implementation costs. |
Returning to the spirit of the RFO, I propose to complete the gene family renaming, including all files within tarballs, to: I am open to objections, including reverting to the previous pattern (effectively, adding back the "extra" field); but absent objections, I'll go ahead with the renaming - probably at the end of the week. I am also open to counter RFOs (or proposals/discussions) to deal with the inconsistencies that you've raised, datastore-wide, @sammyjava - but I think those would be better discussed in a separate issue. |
Hmm, I was just trying to note that someone could know gnm1 is a genome if they had its files independent of the context of datastore structure because "gnm" is supposed to be the part that provides the info (the fact that it is followed by version info in the case of genomes and not in other cases like "mrk" is just because at one point we all agreed that different mrk sets aren't versions of one another). I think all of the datatypes are consistent in that the datatype indicated by the containing folder is given some sort of representation in the filenames (e.g. "genomes" -> gnm+version, "markers" -> mrk, "pangenomes" -> pan+version...) @StevenCannon-USDA regarding the "gensp", the initial driver for considering this change was actually legumeinfo/microservices#616 But @sammyjava raised the additional point that it's hard to recognize Hwangkeum.gnm1.4S83 as a soybean genome without having recourse to some additional level of indirection, which is arguably unFAIR at some level. I do think it's recognizable from the id as a genome ("gnm") just not a glyma gnm. The linkout service hasn't had a problem with gnm/ann being this way because it operates off the identifiers of the contents of the files, not the files themselves, and we do inject the gensp there. The linkouts issue could potentially be solved other ways, but it seemed worth revisiting the naming- after all, as far as revisiting naming, you started it! ;) |
@adf-ncgr - so is the proposal: |
I would have to defer to @sammyjava as to the exact requirement, since it's not %100 clear to me where he gets the identifier for the QTL/GWAS studies from (ie the thing that the trait search tool gets from the mine and will pass along to the linkout service). I think it is actually the identifier attribute given in the README, but for validation's sake that value is also required to match the folder name. And once you change the folder name, adding it to the file names of README (and similar) files seems to follow logically (and makes the naming within folders more consistent). The initial rationale was to help the linkout specification, and is also my primary rationale. But I have some ambivalence about sweeping changes as well and am totally open to further discussion about the cost/benefit of other approaches. |
no objection from me to the proposed naming change for gene families |
While preparing to calculate a new gene family set, I notice that the 2018 gene family collection uses a naming pattern that is (I think) inconsistent with the general pattern throughout the Data Store. Undoubtedly my fault - though the naming scheme may not have been fully gelled as of early 2018.
The pattern that I think we should be following, for consistency, is:
/strain.type.KEY/gensp.strain.type.KEY.filetype.suf
Instead, we had
/legume.genefam.fam1.M65K/
;genefam
is extraneous, andmixed
is the term we've been used when thestrain
is ... mixed. The labellegume
would be appropriate to use in thegensp
position in the filenames.I have provisionally renamed collection as follows:
(By "provisionally" I really mean: I've made this change; if there are strong objections, then I'll revert.)
I have not fixed the yuck prefixes within the files. In fact, I hope we can treat these as "legacy" and just replace them with new families, which would follow the naming (and prefixing) pattern above. The new families should be ready with about a week (early February).
@adf-ncgr @sammyjava
The text was updated successfully, but these errors were encountered: