-
Notifications
You must be signed in to change notification settings - Fork 9
Corona Overview
This tutorial introduces you to the Corona APIs. It's designed to be a quick read, with frequent links to Corona wiki pages for more details. If you do nothing but read this tutorial you'll get an understanding of what Corona can do; if you want to actually do those things you should follow the links.
To start with, Corona assumes three job roles for individuals:
-
The developer. This person does the day to day programming against the Corona endpoints. They're a pro with Java, .NET, Ruby, or some other language, and the Corona documentation is the only exposure they have to MarkLogic.
-
The developer admin. This person controls Corona's administrative settings. For example, they adjust current query settings, any stored transformations which may be called, and index settings. They do this via Corona endpoints separate from those available to the regular developer. They do not access MarkLogic's administrative port 8001.
-
The database admin. This person installs MarkLogic, and uses port 8001 to manage forests, system uptime, and get Corona started. They're the classic IT database administrator, often not a programmer.
This tutorial assumes you're a developer or developer admin.
Let's talk about the basic CRUD operations first. Corona lets you store XML and JSON documents using insert, update, and delete endpoints. The caller can assign to each document:
-
A URI - a unique name, possibly in a directory hierarchy
-
Permissions - security rules for what roles can view and modify the document; users and roles are managed by the database administrator
-
Properties - key-value metadata assignments for the document
-
Collections - named grouping (tagging) for documents
-
A Quality - An integer representing the intrinsic value of a document in a search.
There are a few things Corona does not yet support but likely will over time:
-
Storing binaries or text documents
-
Allowing content modification other than full document replacement
-
Transactional bulk insertions, or a single transaction extending between multiple invocations
Now, if you know MarkLogic you know it doesn't natively parse JSON. So what's going on? Corona invisibly serializes the JSON to and from an internal XML representation that's optimized for query and remains hidden from the outside world. For all intents and purposes it looks like Corona is inserting and returning the JSON directly.
This is a good example of how Corona provides a "managed context" for the documents in its database. That provides many benefits, but it also means all access should go through Corona. Any direct access or modification made to the documents without Corona may produce undesirable results.
Documents can be retrieved by name using the retrieval endpoint, or selected as part of a query (which we'll cover later). For by-name access, you retrieve a document simply by specifying its URI. The document's metadata can be included in the response upon request.
Sometimes you want the full document back and sometimes you just want a piece. To specify a piece you can provide an extra parameter on the retrieval call. For XML documents this parameter is a simplified XPath expression; for JSON documents it's a JSON path. This can greatly cut down on wire transmission overhead, especially for larger documents.
Sometimes you may want a document transformed as part of the retrieval into a new representation, such as an HTML rendering. For this you specify another parameter on the retrieval call indicating the name of an XSLT stylesheet that should process the document. It only works for XML documents, at least today.
Now, you specify the XSLT by name, not as code. Why? For safety. The library of available XSLT stylesheets has to have been established earlier by a "developer admin", using a separate and secured [content transformers] management endpoint. This prohibits regular developers from invoking arbitrary code.
(Note that doing both a subselection and a transformation is possible; the subselection will occur first.)
JSON provides very few native datatypes: just objects, arrays, numbers, strings, booleans, and nulls. Corona extends these datatypes using a casting technique to support date and XML datatypes in JSON.
The way casting works is you modify the key name to include a type suffix. For example:
{"message::date": "Fri Jul 08 2011 14:08:18 GMT-0700 (PDT)"}
{"message::xml": "<div><h1 class='subject'>Hello World</h1><div class='body'>XML inside JSON!!</div></div>"}
The value is always a string, but thanks to the casting suffix it will be seen as either a date or XML node. It supports a huge variety of date formats, and auto-senses them during parsing. The XML nodes are fully searchable as if they were regular XML documents.
Not yet implemented, but eventually, this same technique will be used to indicate language codes for textual content.
Now it's time to talk about search. Corona includes extremely robust support for search-style queries. Results can be sorted by relevance or (soon) by a scalar. Result items can be paged (viewing a small number at a time), returned documents can be snippeted (showing a short content blurb containing the matching terms) and highlighted (to perhaps bold the matching word occurrences).
When executing a search you can choose to retrieve a simple description of the matching documents, or fetch the documents as well, for the sake of efficiency to avoid repeated calls.. If you fetch the documents as part of the search, you can request the same subsetting and transformation occur as for singular document retrievals.
Corona includes three endpoints for issuing search queries:
This is a programmer-friendly way to specify a query as a set of hierarchical query constraints expressed using a JSON encoding. The structured query syntax is highly expressive and supports the full set of MarkLogic index features: free-text search, text containment within a location, text equality to a location, strict value equality at a location, scalar-based range constraints, property-based search, collection-based search, directory-based search, and geospatial search -- or any boolean hierarchical mix of these.
This is a user-friendly way to specify a query as a specially marked-up string similar to those used by Google. This is something a developer could pass directly from the user interface text box to the Corona back-end for execution. The string query syntax supports many but not all the features of structured query service.
This is syntactic sugar, for executing a quick retrieval based on a key that's equal to a certain value.
The search services work well enough for basic purposes out of the box, but for more useful custom functionality it's often necessary for a "developer admin" to configure some aspects of the Corona environment. For example:
The namespaces endpoint configures the set of namespaces recognized by the Corona system. Namespaces of course only matter for XML documents. By specifying a set of namespaces it lets you use a namespace short prefix any time you reference an element or attribute in that namespace, instead of including the long namespace URI. For example, by assigning the "fpml" prefix to the URI "http://www.fpml.org/2005/FpML-4.2" you can refer to elements in that namespace as "fpml:effectiveDate".
The places endpoint creates names for a set of locations in a document, either JSON keys or XML nodes, where a query constraint should apply. For example, RSS has a variety of formats. A single place called "title" could be created that included "rdf:title" (RSS 0.9), "title" (RSS 0.91 thru 2.0, no namespace), and "atom:title" (Atom 1.0) into one.
So what good is defining a place? In structured queries you can use the place name as a more convenient alternative to enumerating a long list of JSON keys or XML nodes specifying where a particular match has to occur.
In string queries the place name automatically becomes a field prefix, available to the user. A user can type title:"all the king's men" and Corona will understand that the phrase has to appear in one of the locations specified by the place "title".
When defining a place you can assign relevance weights to each specified location. This is an important feature for maintaining high-quality relevance-sorted results. Not all locations should be seen as equally relevant. A hit in the title is probably more relevant than a hit in the abstract, which is more relevant than a hit in the body text. A place can group all those locations and weightings into one.
There's also a special place, the place without a name, which controls the behavior of searches that aren't field constrained. In other words, it controls where a query for "foo" looks for matches. This is very important because the majority of users won't type fielded constraints.
The range endpoint manages the sortable values in your data set. If you specify a document location as a range, it creates an index in the background and enables:
-
Fast range queries on that scalar (i.e. limiting to values between X and Y)
-
(Soon) Optimized sorting of results by that scalar (i.e. sort by date)
-
Fast extraction of the scalar's values (for use in facets, discussed later).
Ranges are given names just like places, and those names are used in structured queries and string queries, as well as facets (discussed later).
The bucketed range endpoint is similar to the Range endpoint except it's designed to operate against ranges of values instead of individual values. It lets you "bucket" days into months, or "bucket" prices into groups. With this endpoint you define the start and end points of each bucket and give them names, or let the system do smart auto-bucketing.
While search queries return documents, a facet query returns summaries and statistics. The facet query endpoint accepts a range name or bucketed range name. It then returns all the distinct values (or bucket names), as well as the frequency count for each, sorted either by value or frequency.
By specifying an optional query (it supports either a structured query or string query) you can limit the facet values to documents matching the query.
It's a fairly simple idea but it's tremendously powerful and enables accurate analytics against documents without pre-defining your dimensions. Any ad hoc query will do for limiting the results. This technique is how MarkMail.org produces the facetes on the left hand side of each search result.
OK, it's time for the last endpoint. If you want to see all the server's settings, including what namespaces, places, ranges, and other settings have been assigned, as well as document counts, you can do that with the server status endpoint. It also reports various information about the server.
Corona is very much a work in progress. Its APIs are subject to change. It's not a coincidence that this tutorial is hosted in an easy-to-change wiki. We're at the stage now where we're looking for people's feedback, so please explore, and let us know what you think. You can file issues on GitHub with your bug reports or RFE ideas. You can also message "hunterhacker" on GitHub or email Jason dot Hunter at MarkLogic dot com.