-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a way to download the crawl #1
Comments
What do you mean by downloading the crawl? |
Yeah, sorry @innovationchef, this was unclear. I do mean providing it in a more consumable format than SQL. I don't think CSV is suitable, but maybe JSON with the JSON-LD parts embedded. The motivation for this is that other biological data projects, such as http://fairsharing.org co-ordinated by @Drosophilic may find it useful, rather than having to perform their own crawl. In other words, a bit like http://commoncrawl.org/ though many, many orders of magnitude smaller and only for the JSON-LD parts of the webpages, not everything. I suspect exporting the Solr db will not be that useful since it shreds the input json-ld, though happy to be proved wrong. In which case, the crawler will need access to the Sqlite crawl db, and some code to turn that into a JSON file. But also, see #5 where the we may adopt some other crawler which will have a different intermediate db.... Might want to wait on outcome of that if one doesn't want to risk a bit of wasted work. |
Hey @justinccdev , I have a few more doubts - You can see the code in my branch. |
Hi @innovationchef. So what's happening with the JSON playground is that it tries to replace the @context with one it fetches from the URL value, trying with the header 'Accept:application/ld+json, application/json'. If you execute
you'll see that schema.org returns a valid JSON-LD context. Unfortunately, http://bioschemas.org currently does not. So using schema.org should work fine for things like DataCatalog and DataSet since they actually are schema.org types and Bioschemas just places some restrictions on how they can be used in a bio context (e.g. cardinality constraints, mandatory/recommended/optional properties). But PhysicalEntity was a wholly Bioschemas creation, and even then has been superseded by BioChemEntity, so I'm not sure how easy it would be to use the pyld library as you are in your code. Arguably, I shouldn't have stuck "@context":"http://bioschemas.org" in the synbiomine pages but it seemed like a good idea at the time :). I think that these latter 2 points illustrate one of the interesting challenges in writing Buzzbang, namely that although it's dealing with structured JSON-LD, there are going to be all kinds of invalid and inaccurate things in the data, such as here using a context that doesn't exist. One approach would be to ignore such invalid data entirely. But I would prefer to process it wherever possible, which may be a challenge with libraries that are expecting completely valid input. |
Normally I would wait until a PR to comment on code, but I looked at your repo and here are some points.
|
As you suggested in the first comment to ignore such invalid data entirely or process it wherever possible. If we replace the context from www.bioschema.org to www.schema.org, the jsonld can be interpreted. It is just a hacky way of getting it done. What do you say? I will start working on the other suggestions that you mentioned in the last comment and thank you for your suggestions on using git. I will take care of it from now. |
Thinking about it further. maybe there is a case (at least for now) to reject invalid data since that will make things easier whilst trying to improve other parts of the search engine. I have avoided the problem in this particular case by updating beta.synbiomine.org to have "@Schema":"http://schema.org" as of a923986. |
Hey @justinccdev , |
When the project matures further, look at a way to provide the whole crawl as a download, as per Peter McQuilton
The text was updated successfully, but these errors were encountered: