-
Notifications
You must be signed in to change notification settings - Fork 2
AvroCalciteRest
When developing data services (rest, etc) often you feel like you start implementing a SQL engine. (once you need decent filtering, projections, joins...) When you reach that point it is probably the time you start using one.
In data processing it seems, all roads eventually lead back to SQL
Let's start with an example model:
record Character {
string characterId;
string name = "";
string homeWorldPlanet;
string speciesName;
}
record Friendship {
string characterId1;
string characterId2;
}
record Planet {
string name;
string planetClass;
long age;
/** deprecated field example */
string @deprecated("use something else instead") description;
}
record Species {
string name;
double averageLifeSpanYears;
string originPlanet;
}
And implement REST endpoints for them (characters, friendships, planets, species).
To maker anything like the above queryable all we need to do is add the query resource and integrate anything you want to query by implementing the AvroDataSetContract.
You will be able to query all available schemas(characters,friendships...) like:
- Entire resource: select * from characters
- Projection: select name from characters
- Pagination (limit, offset) select * from characters order by name,characterId limit 2 offset 2
- Filtering: select name from characters where speciesName='vulcan'
- Aggregations: select originPlanet, count(*) as nrSpecies, sum(averageLifeSpanYears)/count(*) as avgLifeSpanYears from species group by originPlanet
- Join: select c.name from characters c, species s where c.speciesName = s.name and s.originPlanet = 'earth'
- More Join: select name, ARRAY(select c2.name from friendships f, characters c2 where f.characterId1 = c.characterId and f.characterId2 = c2.characterId) as friends from characters c
as you can see with "More Join" example as on now (nov 9 2019) we are hitting CALCITE-3488
And since we are using a query engine you can take a look at the execution plan of a query. If this is not enough you can also check out the profiling information as described at
You can also find out the schema of the result-set.
The source code for this example is at and hosted on GKE.
Some observations:
- Apache calcite adds 5 Mb to your app. Not too bad for a decent query engine.
- SQL gives you the power to get exactly the data you need in one request, similar to Facebook's Graph QL. As such it might help in higher latency use cases.
- With Avro you have the option of binary encoding you can get superior wire efficiency compared json based solutions.
- Predicate and projection pushdown is supported and can improve your wire efficiency and performance.
- Compatible schema evolution is something you can enforce at compile time with spf4j maven avro schema plugin and bulding and publishing as demostrated here. Deprecations are comunicated to the client via HTTP Warning headers. (see response headers)
- The calcite plan interpreter has some gaps still, and the gaps are not trivial, so expect to encounter issues. (see, LAG, LEAD not supported yet). On the positive side, the apache calcite community seems to be responsive which is always a good sign, and you can fix whatever you need to for use case. Also there might out there plan interpreters that are more complete...
Things to further explore:
- mutations: insert, update, delete, merge.
- more statistics, optimizations as things get more complex.
- keys, referential integrity.