Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to download the crawl #1

Open
justinccdev opened this issue Oct 20, 2017 · 8 comments
Open

Provide a way to download the crawl #1

justinccdev opened this issue Oct 20, 2017 · 8 comments

Comments

@justinccdev
Copy link
Member

When the project matures further, look at a way to provide the whole crawl as a download, as per Peter McQuilton

@innovationchef
Copy link
Member

What do you mean by downloading the crawl?
Saving the crawl.db in a more readable format like csv? Or, exporting the indexed solr database in xml/json format?

@justinccdev
Copy link
Member Author

Yeah, sorry @innovationchef, this was unclear. I do mean providing it in a more consumable format than SQL. I don't think CSV is suitable, but maybe JSON with the JSON-LD parts embedded. The motivation for this is that other biological data projects, such as http://fairsharing.org co-ordinated by @Drosophilic may find it useful, rather than having to perform their own crawl. In other words, a bit like http://commoncrawl.org/ though many, many orders of magnitude smaller and only for the JSON-LD parts of the webpages, not everything.

I suspect exporting the Solr db will not be that useful since it shreds the input json-ld, though happy to be proved wrong. In which case, the crawler will need access to the Sqlite crawl db, and some code to turn that into a JSON file. But also, see #5 where the we may adopt some other crawler which will have a different intermediate db.... Might want to wait on outcome of that if one doesn't want to risk a bit of wasted work.

@innovationchef
Copy link
Member

innovationchef commented Feb 19, 2018

Hey @justinccdev ,
I tried to implement the download crawl and I have come up with this schema for it -
./bsbang-extract.py <path_to_db> --save-crawl
This will save the crawl in the path_to_db directory as "db_name.json". I am attaching a sample output of my crawl data from http://fairsharing.org . Please have a look and let me know if it is what you are expecting.
download.txt

I have a few more doubts -
I tried to analyze the jsonld crawl that we did on the urls we collected in crawl.db, at https://json-ld.org/ and I observed that for http://fairsharing.org - it works file as depicted here but the same when done for http://beta.synbiomine.org/synbiomine/portal.do?class=Gene&externalids=b3930 causes context URL error. You can see that here. The invalidURL being referred to in the error is this one - http://bioschemas.org . The code that I prepared for downloading the crawl throws error while crawling to this link.

You can see the code in my branch.
I will send a PR after we have come to a conclusion for my problem.

@justinccdev
Copy link
Member Author

justinccdev commented Feb 20, 2018

Hi @innovationchef. So what's happening with the JSON playground is that it tries to replace the @context with one it fetches from the URL value, trying with the header 'Accept:application/ld+json, application/json'. If you execute

curl -L -H 'accept:application/ld+json' http://schema.org

you'll see that schema.org returns a valid JSON-LD context. Unfortunately, http://bioschemas.org currently does not.

So using schema.org should work fine for things like DataCatalog and DataSet since they actually are schema.org types and Bioschemas just places some restrictions on how they can be used in a bio context (e.g. cardinality constraints, mandatory/recommended/optional properties). But PhysicalEntity was a wholly Bioschemas creation, and even then has been superseded by BioChemEntity, so I'm not sure how easy it would be to use the pyld library as you are in your code. Arguably, I shouldn't have stuck "@context":"http://bioschemas.org" in the synbiomine pages but it seemed like a good idea at the time :).

I think that these latter 2 points illustrate one of the interesting challenges in writing Buzzbang, namely that although it's dealing with structured JSON-LD, there are going to be all kinds of invalid and inaccurate things in the data, such as here using a context that doesn't exist. One approach would be to ignore such invalid data entirely. But I would prefer to process it wherever possible, which may be a challenge with libraries that are expecting completely valid input.

@justinccdev
Copy link
Member Author

Normally I would wait until a PR to comment on code, but I looked at your repo and here are some points.

  • I like your use of pyld to export the JSON (that said, it may not be a viable approach as I wrote above). In any case, we really need to use this library to process the extracted JSON before indexing to Solr, instead of my rubbish 'scan the top properties' approach. Even at the simplest level, this would take care of issues like JSON-LD that uses value nodes instead of specifying values directly. In fact, I'm sufficiently embarrassed now that I may make that change before looking at porting everything to scrapy/frontera, though if you want to look at this too, that would be welcome (though it might be overlapping effort).
  • Do you think it would be possible to use pyld.compact() rather than expand() to dump? I prefer the more compact representation.
  • I would rather that this dump facility be a completely separate Python program (perhaps bsbang-dump.py?) and process the sqlite database. This is so that it's possible to re-dump without also having to re-extract at the same time, among other things. That said, all this may have to change with a scrapy/frontera port anyway.
  • When you submit a PR, could you squash all the tiny commits into one or a few commits to make it easier to review? I know the existing repo has lots of tiny commits - my excuse is that I was only coding solo at the time and not doing PRs :)
  • If applicable, could you give the commits more detail names than 'added dowbload crawl function'? Maybe about what switch was added, for instance.
  • Please be careful with spelling and typos. Thanks!

@innovationchef
Copy link
Member

As you suggested in the first comment to ignore such invalid data entirely or process it wherever possible. If we replace the context from www.bioschema.org to www.schema.org, the jsonld can be interpreted. It is just a hacky way of getting it done. What do you say?

I will start working on the other suggestions that you mentioned in the last comment and thank you for your suggestions on using git. I will take care of it from now.

@justinccdev
Copy link
Member Author

Thinking about it further. maybe there is a case (at least for now) to reject invalid data since that will make things easier whilst trying to improve other parts of the search engine.

I have avoided the problem in this particular case by updating beta.synbiomine.org to have "@Schema":"http://schema.org" as of a923986.

@innovationchef
Copy link
Member

Hey @justinccdev ,
I think I made a mistake by using pyld to expand or compact the jsonld. The current extraction is already compact(output generated from your bsbang-extract.py) -
{'@context': 'http://schema.org', 'identifier': 'b3176', 'url': 'http://beta.synbiomine.org/synbiomine/report.do?id=2019002', 'additionalType': 'http://www.ebi.ac.uk/ols/ontologies/so/terms?obo_id=SO:0000704', '@type': 'PhysicalEntity', 'name': 'Gene glmM E. coli str. K-12 substr. MG1655 b3176'}
So now, the dumping process becomes as easy as reading the jsonld table from the database and simply dumping it in a new file. I am sending the PR for approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants