-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get query's tables and columns #3
Conversation
clj -T:build jar | ||
``` | ||
|
||
This will create a JAR in the `target` directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be expanded once I set up Clojars.
|
||
@Override | ||
public void visit(Table table) { | ||
this.sqlVisitor.visitTable(table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and similar for columns) is one of the key changes from TablesNamesFinder.java
, which this is mostly a copy-and-paste of. The original class kept a List
of table names that were added to in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Tim, why did you copy TNF.java?"
Because walking the AST is a lot of legwork and TNF already implemented it nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a comments in Github, you could defend the design (and guide future maintainers) with comments.
- Have a "This class copies TNF as of . {{reasons}}. See method docstrings for where we have deviated"
- Docstrings to mark the surgical spots where you changed stuff.
src/macaw/core.clj
Outdated
(swap! column-names conj (.getColumnName column))) | ||
(visitTable [_this _table]))] | ||
(.walk (ASTWalker. column-finder) parsed-query) | ||
@column-names)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while lying in bed this morning I started to fantasize about adding some sort of context+results object as an argument to the visit____ methods so that 1) we didn't have to close over the columns
atom 2) we could keep the breadcrumbs of what part of a SQL query the column/table was found in.
That's likely to happen at some point. But for now this way works and I don't want to make things too complicated too early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That I like that idea, basically doing a fold over the tree, creeping towards catamorphism 👽
(deftest ^:parallel query->columns-test | ||
(testing "Simple queries" | ||
(is (= #{"foo" "bar" "id" "quux_id"} | ||
(columns "select foo, bar from baz inner join quux on quux.id = baz.quux_id"))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's part of the tech plan to source crazy queries to parse, but for now I just have the one as a sanity check. JSqlParser is doing all the heavy lifting, so this test exists to make sure that we're using JSqlParser correctly. On a separate occasion, we'll do testing to find the holes in JSqlParser's coverage.
src/macaw/core.clj
Outdated
column-finder (reify | ||
SqlVisitor | ||
(^void visitColumn [_this ^Column column] | ||
(swap! column-names conj (.getColumnName column))) | ||
(visitTable [_this _table]))] | ||
(.walk (ASTWalker. column-finder) parsed-query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this interface, I was thinking of having something like:
(query/walk
{:column (fn [^Column column]
(swap! column-names conj (.getColumnName column))}
parsed-query)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we're going to do accumulate things pretty often, perhaps even have some sugar over that:
(query/gather
{:column #(.getColumnName ^Column %)}
parsed-query)
That could return a map of keys to all the non-nil values calculated from the corresponding node visits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly opposed to the sugar, but I'm not mad enough about the desugared approach to change it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to walk the fine line between "sugar" and "unnecessary abstraction"
java/com/metabase/SqlVisitor.java
Outdated
public interface SqlVisitor { | ||
|
||
/** | ||
* Called for every `Column` encountered, presumably for side effects. | ||
*/ | ||
public void visitColumn(Column column); | ||
|
||
/** | ||
* Called for every `Table` encountered, presumably for side effects. | ||
*/ | ||
public void visitTable(Table table); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the kind of interface I imagined. I'd prefer to get away from needing to extend or implement anything.
The idea I had was for there to a concrete class, your "copy" of TNF, with the major difference being that it takes a Map<String, Consumer<Visitable>>
argument in its constructor.
For each overloading of visit, it would check if the given map has an entry for the corresponding type, and if so call it with its argument before calling the rest of its own traversal code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my bad for misunderstanding.
I was going to start bikeshedding about whether this was all a moot point after the context/results object was added above, but on further reflection I think that's orthogonal.
I think the two approaches are maybe not so different? Here are the pros (in bold) and cons I can see:
SqlVisitor interface |
Map of fns |
---|---|
Type safety: can't forget or mis-type a visit___ method |
(was it :columns or :column ?) |
Avoids the faff of making a helper to turn fn s into a Consumer instance. Although I suppose we could use IFn directly |
|
verbose for the client code | More Clojure-y interface; using reify is kind of a pain; avoids providing blank implementations for unneeded methods |
The gather sugar is nice |
I don't think there's a slam-dunk argument in either direction, so my bias is to leave it for now and use a) accumulating other things (we've done it…once!) or b) needing the query context as the opportunity to refactor as wanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the table, here are my thoughts:
- "Can't forget a visit method" - the list we support may get quite long, painful to be exhaustive.
- "Was it :columns or :column" - spec / malli / kondo / assert to the rescue
- "Avoids the faff of turning fns to Consumer" - whoops, I thought Clojure magically made this work now. This is a Clojure oriented library so we could definitely use
IFn
in the map, and maybe evenKeyword
too, but I think converting to String before passing to the constructor will be more ergonomic.
For me the key thing is how pleasant this is to work with as we start to care about more kinds of leaves and clauses. I'm OK to go with what works and rework as we add more requirements.
Will give this a closer look tomorrow, just did a high level flyby. Also interested to hear others opinions before I commit to a tick 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this looks quite nice to me
(defn- kv->java [[k f]]
[(name k)
(reify java.util.function.Consumer
(accept [_this item]
(f item)))])
;; Malli stuff on the args, also forgotten what the visitor interface looked like
(defn walk-query [visitor-map parsed-query]
(.run (MyFancyVisitor. (into {} (map kv->java) visitor-map)
parsed-query)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, just thought of a few more potential benefits to the "data oriented" approach:
- Fusion - we can
(merge-with call-both)
over all our walkers, and do a single walk for efficiency. - Decoration - easy to wrap functions, e.g. for debugging or logging.
- Derivation - maybe we can derive more complete mappings from incomplete maps.
- Generalization - perhaps nice register functions for some group of entities in a hierarchy
I just love data, man 🌈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I'd heard something! The design and how they reached it is really interesting too.
Unfortunately it won't cover the case of pulling an IFn out of the map in Java, no automatic conversion there. It would make kv->java
a one liner though, and lambda conversions look so nice and light compared to reify 😍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated! There's some Clojure lib-specific voodoo on the Java side, but that doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think converting to String before passing to the constructor will be more ergonomic
I thought about this, but then the look of
{:column (fn [column] ...)}
was just too clean to pass up. So you see my Keyword->String shuffling in the ASTWalker constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation is very much about making the Clojure code look as clean as possible, to the detriment of cute Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Was it :columns or :column" - spec / malli / kondo / assert to the rescue
Malli is on the radar, but…not in this PR ;)
a4b6fed
to
7e7cbeb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only blocking comments are around NPEs and a subtle bug with using transients, but here take this nits as well 😄
If you're working on Macaw and make changes to a Java file, you must: | ||
|
||
1. Recompile | ||
2. Restart your Clojure REPL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to improve this workflow in 2024? 😢 (outside scope of this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly with the right ns refresh, but I haven't worked it out :(
import net.sf.jsqlparser.statement.upsert.Upsert; | ||
|
||
/** | ||
* Walks the AST, using JSqlParser's `visit()` methods. Each `visit()` method additionally calls an applicable callback method provided in the `callbacks` map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Inconsistent line width
public ASTWalker(Map<Keyword, IFn> callbacksWithKeywordKeys) { | ||
this.callbacks = new HashMap<String, IFn>(SUPPORTED_CALLBACK_KEYS.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: One thing I really like about doing the key transformation in here is that we encapsulate a copy of the argument, rather than assuming it's already immutable - which Java's types can't enforce.
Note: that this constructor might not be allocating enough capacity, as it is not considering the load factor. You should instead be using HashMap.newHashMap
, if you're on Java 19 or later.
Idea: it might be worth asserting that there are no unknown callback keys in this map, particularly since we don't have Malli or friends providing validation in the Clojure wrapper (yet).
for(Map.Entry<Keyword, IFn> entry : callbacksWithKeywordKeys.entrySet()) { | ||
this.callbacks.put(entry.getKey().getName(), entry.getValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will register functions even if they're not for known callbacks, it may make more sense to loop over the known keys instead. This would also give us the opportunity to register "no-op" functions and simplify where we call the callbacks.
The stickler in me would also prefer if we worked with Consumer
rather than IFn
, but it's probably not worth doing this until the next Clojure release makes that conversion cheaper.
|
||
@Override | ||
public void visit(Table table) { | ||
this.callbacks.get(TABLE_STR).invoke(table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can NPE currently, but see my previous comment for one way to get around this. Other options would be to use containsKey
or getOrDefault
with a NOOP
Ifn.
src/macaw/core.clj
Outdated
|
||
Note: 'fully-qualified' means 'as found in the query'; it doesn't extrapolate schema names from other data sources." | ||
(Specifically, it returns their fully-qualified names as strings, where 'fully-qualified' means 'as found in the query'.)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth being super explicit and saying something like "where names are not qualified within the query, they are returned as simple names, without the namespace (table) being inferred."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To handle the general case in future I guess the parser will need to do this inference if we want it, as it may depend on the scope in a nested query. No need to say anything about that, just musing.
src/macaw/core.clj
Outdated
(let [column-names (transient #{}) | ||
table-names (transient #{}) | ||
ast-walker (ASTWalker. {:column (fn [^Column column] | ||
(conj! column-names (.getColumnName column))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately you can't replace an atom with a transient like this:
Note in particular that transients are not designed to be bashed in-place. You must capture and use the return value in the next call.
(Taken from https://clojure.org/reference/transients)
The current code can break if the underlying tree ends up inserting a root above the node being referenced, so it's an evil case where it can still work for small examples.
You could wrap the transient in a volatile, but I also think an atom is fine - given the expected size of the collections and how well the JVM optimizes memory barriers I don't think the performance here is a hotspot. There's even an argument for never optimizing unless you also add a (micro)benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, this aspect of working with transients is way too easy to miss, I feel even the docs don't do enough to highlight it.
A "must use" type constraint on the return value is a great way to catch these kinds of mistakes, for example Java has an CheckReturnValue
annotation for linting, I wonder if there isn't a way we could also lint to catch these kinds of bugs.
not blocking but I'd like your thoughts on this approach (require '[methodica.core :as m])
(import (net.sf.jsqlparser.parser
CCJSqlParserUtil)
(net.sf.jsqlparser.schema
Column
Table)
(net.sf.jsqlparser.statement.select
Select
SelectItem
PlainSelect))
(m/defmulti ast-walker
(fn [obj _collector]
(class obj)))
(m/defmethod ast-walker :default
[_obj _collector]
nil)
(m/defmethod ast-walker :around :default
[obj collector]
(collector obj)
(next-method obj collector))
(m/defmethod ast-walker Select
[obj collector]
(when (some? (.getSelectItems obj))
(run! #(ast-walker % collector) (.getSelectItems obj)))
(when
(some? (.getFromItem obj))
(ast-walker (.getFromItem obj) collector)))
(m/defmethod ast-walker SelectItem
[obj collector]
(ast-walker (.getExpression obj) collector))
(defn get-table-and-columns
[query]
(let [data (atom {:columns []
:tables []})
collector (fn [x]
(cond-> data
(instance? Table x)
(swap! update :tables conj (.getName x))
(instance? Column x)
(swap! update :columns conj (.getFullyQualifiedName x true))))]
(ast-walker (CCJSqlParserUtil/parse query) collector)
@data))
(get-table-and-columns "select id, name from my_table")
;; => {:columns ["id" "name"], :tables ["my_table"]} this is an "improvised" visitor pattern in Clojure :D, where all the primary implementation just walk the query. (m/defmethod ast-walker :around :default
[obj collector]
(collector obj)
(next-method obj collector)) this ensures I think the only cons here is that we have to implement the walking logic, but that's doable, especially we have the TNF class as a reference. would love to have your thoughts on this :) |
@qnkhuat that's effectively what we have in Java (just without elegant Clojure/methodical sugar); the problem is that there are so many classes to deal with. I haven't run your code, but I don't see how it could deal with joins, for example. At some point you'd end up with about 1000 lines of Clojure that were very similar to the current ASTWalker class. An argument could be made for…doing just that. Translating the code would be tedious but ultimately not take that long. I didn't because I had a preference for adapting a well-tested, library-endorsed Java file rather than DIYing it, and with all the scar tissue I have from Clojure-Java interop I strongly suspect that future concerns will be glad to have the Java escape hatch. For example, getting the line/column information out of the parse tree looks like it's going to be a mess and almost certainly easier on the Java side. |
On the topic of multimethods, your example looks a lot like an early prototype Tim showed me, although we hadn't had the idea of the callback map yet. Semantically it's still very close to this Java based solution, and here are my thoughts on the relative merits. Java pros:
None of these are game changers, but they add up. Multi-methods pros:
In my opinion the benefits here are less clear, and I think it will overall be more painful to work with. This is the crux of my preference for "just java". |
public static final String COLUMN_STR = "column"; | ||
public static final String TABLE_STR = "table"; | ||
public static final Set<String> SUPPORTED_CALLBACK_KEYS = Set.of(COLUMN_STR, TABLE_STR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this for another day, but I realized that using an Enum for these values would also let you use an EnumMap
for callbacks
, which is much more efficient that a regular map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I guess I just love control from my REPL, and also, by taking a But anyway I think the point about using well-tested code makes sense. |
The best introduction to this is the javadoc forSqlVisitor.java
. I'll also add comments to key places in the source code.The Javadoc for ASTWalker.java should get you started, and once you read that the Clojure should be fairly self-explanatory.