Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get query's tables and columns #3

Merged
merged 15 commits into from
Feb 20, 2024
Merged

Conversation

tsmacdonald
Copy link
Contributor

@tsmacdonald tsmacdonald commented Feb 15, 2024

The best introduction to this is the javadoc for SqlVisitor.java. I'll also add comments to key places in the source code.

The Javadoc for ASTWalker.java should get you started, and once you read that the Clojure should be fairly self-explanatory.

@tsmacdonald tsmacdonald requested a review from a team February 15, 2024 11:09
@tsmacdonald tsmacdonald self-assigned this Feb 15, 2024
clj -T:build jar
```

This will create a JAR in the `target` directory.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be expanded once I set up Clojars.

build.clj Show resolved Hide resolved

@Override
public void visit(Table table) {
this.sqlVisitor.visitTable(table);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and similar for columns) is one of the key changes from TablesNamesFinder.java, which this is mostly a copy-and-paste of. The original class kept a List of table names that were added to in this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Tim, why did you copy TNF.java?"

Because walking the AST is a lot of legwork and TNF already implemented it nicely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a comments in Github, you could defend the design (and guide future maintainers) with comments.

  1. Have a "This class copies TNF as of . {{reasons}}. See method docstrings for where we have deviated"
  2. Docstrings to mark the surgical spots where you changed stuff.

(swap! column-names conj (.getColumnName column)))
(visitTable [_this _table]))]
(.walk (ASTWalker. column-finder) parsed-query)
@column-names))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while lying in bed this morning I started to fantasize about adding some sort of context+results object as an argument to the visit____ methods so that 1) we didn't have to close over the columns atom 2) we could keep the breadcrumbs of what part of a SQL query the column/table was found in.

That's likely to happen at some point. But for now this way works and I don't want to make things too complicated too early.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I like that idea, basically doing a fold over the tree, creeping towards catamorphism 👽

(deftest ^:parallel query->columns-test
(testing "Simple queries"
(is (= #{"foo" "bar" "id" "quux_id"}
(columns "select foo, bar from baz inner join quux on quux.id = baz.quux_id")))))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the tech plan to source crazy queries to parse, but for now I just have the one as a sanity check. JSqlParser is doing all the heavy lifting, so this test exists to make sure that we're using JSqlParser correctly. On a separate occasion, we'll do testing to find the holes in JSqlParser's coverage.

Comment on lines 29 to 34
column-finder (reify
SqlVisitor
(^void visitColumn [_this ^Column column]
(swap! column-names conj (.getColumnName column)))
(visitTable [_this _table]))]
(.walk (ASTWalker. column-finder) parsed-query)
Copy link
Collaborator

@crisptrutski crisptrutski Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this interface, I was thinking of having something like:

(query/walk 
  {:column (fn [^Column column]
             (swap! column-names conj (.getColumnName column))}
  parsed-query)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we're going to do accumulate things pretty often, perhaps even have some sugar over that:

(query/gather
  {:column #(.getColumnName ^Column %)}
  parsed-query)

That could return a map of keys to all the non-nil values calculated from the corresponding node visits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly opposed to the sugar, but I'm not mad enough about the desugared approach to change it yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to walk the fine line between "sugar" and "unnecessary abstraction"

Comment on lines 37 to 48
public interface SqlVisitor {

/**
* Called for every `Column` encountered, presumably for side effects.
*/
public void visitColumn(Column column);

/**
* Called for every `Table` encountered, presumably for side effects.
*/
public void visitTable(Table table);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the kind of interface I imagined. I'd prefer to get away from needing to extend or implement anything.

The idea I had was for there to a concrete class, your "copy" of TNF, with the major difference being that it takes a Map<String, Consumer<Visitable>> argument in its constructor.

For each overloading of visit, it would check if the given map has an entry for the corresponding type, and if so call it with its argument before calling the rest of its own traversal code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my bad for misunderstanding.

I was going to start bikeshedding about whether this was all a moot point after the context/results object was added above, but on further reflection I think that's orthogonal.

I think the two approaches are maybe not so different? Here are the pros (in bold) and cons I can see:

SqlVisitor interface Map of fns
Type safety: can't forget or mis-type a visit___ method (was it :columns or :column?)
Avoids the faff of making a helper to turn fns into a Consumer instance. Although I suppose we could use IFn directly
verbose for the client code More Clojure-y interface; using reify is kind of a pain; avoids providing blank implementations for unneeded methods
The gather sugar is nice

I don't think there's a slam-dunk argument in either direction, so my bias is to leave it for now and use a) accumulating other things (we've done it…once!) or b) needing the query context as the opportunity to refactor as wanted?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the table, here are my thoughts:

  1. "Can't forget a visit method" - the list we support may get quite long, painful to be exhaustive.
  2. "Was it :columns or :column" - spec / malli / kondo / assert to the rescue
  3. "Avoids the faff of turning fns to Consumer" - whoops, I thought Clojure magically made this work now. This is a Clojure oriented library so we could definitely use IFn in the map, and maybe even Keyword too, but I think converting to String before passing to the constructor will be more ergonomic.

For me the key thing is how pleasant this is to work with as we start to care about more kinds of leaves and clauses. I'm OK to go with what works and rework as we add more requirements.

Will give this a closer look tomorrow, just did a high level flyby. Also interested to hear others opinions before I commit to a tick 😄

Copy link
Collaborator

@crisptrutski crisptrutski Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this looks quite nice to me

(defn- kv->java [[k f]]
  [(name k)
   (reify java.util.function.Consumer
     (accept [_this item]
       (f item)))])

;; Malli stuff on the args, also forgotten what the visitor interface looked like
(defn walk-query [visitor-map parsed-query]
  (.run (MyFancyVisitor. (into {} (map kv->java) visitor-map)
        parsed-query)

Copy link
Collaborator

@crisptrutski crisptrutski Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, just thought of a few more potential benefits to the "data oriented" approach:

  1. Fusion - we can (merge-with call-both) over all our walkers, and do a single walk for efficiency.
  2. Decoration - easy to wrap functions, e.g. for debugging or logging.
  3. Derivation - maybe we can derive more complete mappings from incomplete maps.
  4. Generalization - perhaps nice register functions for some group of entities in a hierarchy

I just love data, man 🌈

Copy link
Collaborator

@crisptrutski crisptrutski Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I'd heard something! The design and how they reached it is really interesting too.

Unfortunately it won't cover the case of pulling an IFn out of the map in Java, no automatic conversion there. It would make kv->java a one liner though, and lambda conversions look so nice and light compared to reify 😍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated! There's some Clojure lib-specific voodoo on the Java side, but that doesn't matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think converting to String before passing to the constructor will be more ergonomic

I thought about this, but then the look of

{:column (fn [column] ...)}

was just too clean to pass up. So you see my Keyword->String shuffling in the ASTWalker constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is very much about making the Clojure code look as clean as possible, to the detriment of cute Java.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Was it :columns or :column" - spec / malli / kondo / assert to the rescue

Malli is on the radar, but…not in this PR ;)

@tsmacdonald tsmacdonald changed the title Implement query->columns using the new SqlVisitor interface Get query's tables and columns Feb 19, 2024
Copy link
Collaborator

@crisptrutski crisptrutski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only blocking comments are around NPEs and a subtle bug with using transients, but here take this nits as well 😄

If you're working on Macaw and make changes to a Java file, you must:

1. Recompile
2. Restart your Clojure REPL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to improve this workflow in 2024? 😢 (outside scope of this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly with the right ns refresh, but I haven't worked it out :(

import net.sf.jsqlparser.statement.upsert.Upsert;

/**
* Walks the AST, using JSqlParser's `visit()` methods. Each `visit()` method additionally calls an applicable callback method provided in the `callbacks` map.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Inconsistent line width

Comment on lines 229 to 230
public ASTWalker(Map<Keyword, IFn> callbacksWithKeywordKeys) {
this.callbacks = new HashMap<String, IFn>(SUPPORTED_CALLBACK_KEYS.length);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: One thing I really like about doing the key transformation in here is that we encapsulate a copy of the argument, rather than assuming it's already immutable - which Java's types can't enforce.

Note: that this constructor might not be allocating enough capacity, as it is not considering the load factor. You should instead be using HashMap.newHashMap, if you're on Java 19 or later.

Idea: it might be worth asserting that there are no unknown callback keys in this map, particularly since we don't have Malli or friends providing validation in the Clojure wrapper (yet).

Comment on lines 231 to 232
for(Map.Entry<Keyword, IFn> entry : callbacksWithKeywordKeys.entrySet()) {
this.callbacks.put(entry.getKey().getName(), entry.getValue());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will register functions even if they're not for known callbacks, it may make more sense to loop over the known keys instead. This would also give us the opportunity to register "no-op" functions and simplify where we call the callbacks.

The stickler in me would also prefer if we worked with Consumer rather than IFn, but it's probably not worth doing this until the next Clojure release makes that conversion cheaper.


@Override
public void visit(Table table) {
this.callbacks.get(TABLE_STR).invoke(table);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can NPE currently, but see my previous comment for one way to get around this. Other options would be to use containsKey or getOrDefault with a NOOP Ifn.


Note: 'fully-qualified' means 'as found in the query'; it doesn't extrapolate schema names from other data sources."
(Specifically, it returns their fully-qualified names as strings, where 'fully-qualified' means 'as found in the query'.)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth being super explicit and saying something like "where names are not qualified within the query, they are returned as simple names, without the namespace (table) being inferred."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle the general case in future I guess the parser will need to do this inference if we want it, as it may depend on the scope in a nested query. No need to say anything about that, just musing.

(let [column-names (transient #{})
table-names (transient #{})
ast-walker (ASTWalker. {:column (fn [^Column column]
(conj! column-names (.getColumnName column)))
Copy link
Collaborator

@crisptrutski crisptrutski Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately you can't replace an atom with a transient like this:

Note in particular that transients are not designed to be bashed in-place. You must capture and use the return value in the next call.

(Taken from https://clojure.org/reference/transients)

The current code can break if the underlying tree ends up inserting a root above the node being referenced, so it's an evil case where it can still work for small examples.

You could wrap the transient in a volatile, but I also think an atom is fine - given the expected size of the collections and how well the JVM optimizes memory barriers I don't think the performance here is a hotspot. There's even an argument for never optimizing unless you also add a (micro)benchmark.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, this aspect of working with transients is way too easy to miss, I feel even the docs don't do enough to highlight it.

A "must use" type constraint on the return value is a great way to catch these kinds of mistakes, for example Java has an CheckReturnValue annotation for linting, I wonder if there isn't a way we could also lint to catch these kinds of bugs.

@qnkhuat
Copy link

qnkhuat commented Feb 20, 2024

not blocking but I'd like your thoughts on this approach

(require '[methodica.core :as m])
(import (net.sf.jsqlparser.parser
         CCJSqlParserUtil)
        (net.sf.jsqlparser.schema
         Column
         Table)
        (net.sf.jsqlparser.statement.select
         Select
         SelectItem
         PlainSelect))


(m/defmulti ast-walker
  (fn [obj _collector]
    (class obj)))

(m/defmethod ast-walker :default
  [_obj _collector]
  nil)

(m/defmethod ast-walker :around :default
  [obj collector]
  (collector obj)
  (next-method obj collector))

(m/defmethod ast-walker Select
  [obj collector]
  (when (some? (.getSelectItems obj))
    (run! #(ast-walker % collector) (.getSelectItems obj)))
  (when
    (some? (.getFromItem obj))
    (ast-walker (.getFromItem obj) collector)))

(m/defmethod ast-walker SelectItem
  [obj collector]
  (ast-walker (.getExpression obj) collector))

(defn get-table-and-columns
  [query]
  (let [data         (atom {:columns []
                            :tables  []})
        collector    (fn [x]
                       (cond-> data
                         (instance? Table x)
                         (swap! update :tables conj (.getName x))

                         (instance? Column x)
                         (swap! update :columns conj (.getFullyQualifiedName x true))))]
    (ast-walker (CCJSqlParserUtil/parse query) collector)
    @data))

(get-table-and-columns "select id, name from my_table")

;; => {:columns ["id" "name"], :tables ["my_table"]}

this is an "improvised" visitor pattern in Clojure :D, where all the primary implementation just walk the query.
The real side-affect is done in the around method (read more about it here)

(m/defmethod ast-walker :around :default
  [obj collector]
  (collector obj)
  (next-method obj collector))

this ensures collector will be called on all objects that ast-walker walks through.

I think the only cons here is that we have to implement the walking logic, but that's doable, especially we have the TNF class as a reference.

would love to have your thoughts on this :)

@tsmacdonald
Copy link
Contributor Author

@qnkhuat that's effectively what we have in Java (just without elegant Clojure/methodical sugar); the problem is that there are so many classes to deal with. I haven't run your code, but I don't see how it could deal with joins, for example. At some point you'd end up with about 1000 lines of Clojure that were very similar to the current ASTWalker class.

An argument could be made for…doing just that. Translating the code would be tedious but ultimately not take that long. I didn't because I had a preference for adapting a well-tested, library-endorsed Java file rather than DIYing it, and with all the scar tissue I have from Clojure-Java interop I strongly suspect that future concerns will be glad to have the Java escape hatch. For example, getting the line/column information out of the parse tree looks like it's going to be a mess and almost certainly easier on the Java side.

@crisptrutski
Copy link
Collaborator

crisptrutski commented Feb 20, 2024

On the topic of multimethods, your example looks a lot like an early prototype Tim showed me, although we hadn't had the idea of the callback map yet. Semantically it's still very close to this Java based solution, and here are my thoughts on the relative merits.

Java pros:

  1. Java tooling is very good at generating and editing boilerplate, because its intrinsic.
  2. We can easily copy-paste from upstream if AST changes, without needing to reconvert.
  3. Faster - not using dynamic dispatch, and avoiding reflection.

None of these are game changers, but they add up.

Multi-methods pros:

  1. Avoid mixing languages and an inferior REPL workflow.
  2. Powerful meta programming tools, like :around and multiple dispatch, but do we need them?
  3. ... not sure what else I'm missing?

In my opinion the benefits here are less clear, and I think it will overall be more painful to work with. This is the crux of my preference for "just java".

Comment on lines +213 to +215
public static final String COLUMN_STR = "column";
public static final String TABLE_STR = "table";
public static final Set<String> SUPPORTED_CALLBACK_KEYS = Set.of(COLUMN_STR, TABLE_STR);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this for another day, but I realized that using an Enum for these values would also let you use an EnumMap for callbacks, which is much more efficient that a regular map.

Copy link
Collaborator

@crisptrutski crisptrutski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tsmacdonald tsmacdonald merged commit 4f8e0ec into master Feb 20, 2024
4 checks passed
@tsmacdonald tsmacdonald deleted the abstract-factory-factory-visitor branch February 20, 2024 16:58
@qnkhuat
Copy link

qnkhuat commented Feb 21, 2024

I guess I just love control from my REPL, and also, by taking a collector, we can de-couple logic to get tables and column names from the Java class; it gives Macaw a higher level of abstraction.

But anyway I think the point about using well-tested code makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants