Skip to content
Justin Clark-Casey edited this page Mar 20, 2018 · 29 revisions

Notes for potential GSOC students

  • Please read this wiki and the various links to get a feel for this project.
  • Please set up the crawler and the frontend. The crawler only fetches a very small number of pages (though beta.synbiomine.org is temporarily broken, so you may have to comment this out until I get it fixed shortly).
  • This is a project that will eventually crawl Bioschemas markup embedded in InterMine (only with dev instances in the GSOC timeframe), but it is a separate set of projects from InterMine itself. It's also intended that Buzzbang will crawl many other life sciences websites embedding Bioschemas information.
  • There are very few life sciences websites currently embedding Bioschemas data, and this is subject to change. For the GSOC timeframe, the likely crawl target will be the EBI's biosamples website. This is much larger than any crawl target to date, so will need careful thinking about scalability. It's not necessary that a GSOC target can complete a crawl, but that it should be capable of doing so in a reasonable timeframe given reasonable server and network resources. That said, the results of an ongoing crawl must be searchable.
  • We can assume that pages on the EBI site can be found via its sitemap.xml. Therefore, in the GSOC timeframe there is no strict need for a more sophisticated link-following crawl, or one that needs to render the page in a headless web-browser before extracting the JSON-LD. However, it would be for a proposed design to anticipate that possibility later on.
  • Bioschemas and this project are actively evolving, so expect change! Other InterMine GSOC projects may be more stable. On one hand, that makes this project more complicated to work with. On the other hand, Buzzbang allows a contribution to a developing area where there are currently few other competing projects.
  • Please feel free to ask me (justincc AT intermine.org) any questions.

Presentation

I (justincc) recently made a presentation on Buzzbang at the Bioschemas Samples hackathon. At some point the relevant architectural bits will be transferred into this wiki.

State of play

schema.org is a community project to develop a set of schemas that can be embedded in webpages, in formats such as JSON-LD, RDFa and Microdata. Example schemas include Movie, Store and Product. Among other usecases, this embedded data can then be crawled by search engines such as Google and Yandex, and used to return useful structured results on queries (such as the information boxes you see on some Google search results).

Bioschemas is a community project by the life sciences community to specify how schemas from schema.org can be used to markup life sciences information. As such, it has 2 aspects:

  1. When using existing schema.org schemas, such as DataCatalog and Dataset, Bioschemas will specify which properties are mandatory, which optional and the cardinality of properties. This is because schema.org itself specifies none of these things.
  2. In some cases, Bioschemas will come up with new schemas, such as BioChemEntity to describe biological and chemical entities, where nothing suitable pre-exists in schema.org. Once these have gone through review by the Bioschemas community, they will also be suggested to the main schema.org community.

Bioschemas is an extremely young project. As such, the specifications are subject to considerable change and some are not final (in particular BioChemEntity). In addition, very few life sciences information sources have yet implemented this markup. Nonetheless, bsbang-crawler is an alpha project to start crawling this data so that it can then be searched in the companion frontend project, and later on possibly joined together using embedded ontology terms in the form of a knowledge graph.

As an alpha project, bsbang-crawler is itself subject to considerable change. Until now, the crawler has been custom written. However, this is a poor choice for future scalability and maintainability, so whilst there might be a bit more work done on the custom crawler code, we are actively looking at a Transition to an established crawler package.

There is a companion bsbang-frontend companion project which is a very simple Google-like search engine on top of the extracted data.

Limitations

Buzzbang, certainly at this stage, will only crawl JSON-LD markup, no other page scraping will be performed - if it isn't in the markup then it won't be extracted. This is a deliberate design choice to keep the crawler simpler and encourage markup in Bioschemas, which is the whole point of Bioschemas.

Life sciences websites with Bioschemas markup

Some sites have initial markup, chiefly those listed in https://github.com/justinccdev/bsbang-crawler/blob/dev/conf/default-targets.txt. However, this is very limited in scope. The most promising set of sites for initial markup are samples databases. As shown on the Bioschemas front page, there is an event on 15-16 March 2018 to introduce the Bioschemas Sample schema to Biobanks, for their feedback and to help them implement markup on their sites if they choose to do so. This may result in a large amount of useful real-world markup that Buzzbang can crawl.

Related work

References

Clone this wiki locally