-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAP-667] [Feature] python model support via livy #821
Comments
That would be amazing. I am interested in contributing to this feature. |
@dataders, interesting, I wasn't aware of this development. Obvious thoughts from a high level scan of the design docs and spark jira ticket : It's not clear to me that this spark connect can remotely execute python code, but rather it allows the remote execution of the spark dataframe api via a local python process to a lightweight spark client. This means, for example, that basic python like the re module or numpy, scikit-learn would not be available.
My understanding of the intent behind the DBT python models is not an intention to support spark, but rather to enable the execution of python in a remote runtime. An abstraction supporting only the spark dataframe api will, by necessity, only support calls to the spark dataframe api, and any calls which are not made to a spark context will be executed inside the dbt process. However, this API does support SQL and could be used to submit SQL if required. Note though:
The UDF implementation supports python UDFs by spinning up python sidecars inside the process as required. This means that outside of UDFs there doesn't seem to be support for remote python execution. Apache Livy, on the other hand is a remote spark context management tool. That means a python driver process is started along with a spark JVM process for pyspark sessions which will be available to execute python code in the driver process. I do wonder though if there are other alternative remote code execution clients for python that could themselves use spark. It feels like support for grid computing tools, e.g. slurm might allow a more generic python (or other) runtime which could then themselves execute spark if required. But, to stay consistent with the other platform implementations I'd say Livy is probably the most consistent but the project is a bit dormant. Apache Toree might work too, but is intended for notebooks. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Is this your first time submitting a feature request?
Describe the feature
The
method: thrift
method of connecting to spark-sql can never support python models, because it is designed for jdbc clients like beeline.The proposal is to support dbt python modules using a connection to a livy server, which is designed to provide a restful api around spark contexts which can be hosted remotely or locally.
An implementation would implement a
method: livy
which would allow both SQL and python models to be submitted via livy.Describe alternatives you've considered
Spark-session support might be a way of implementing this, though it would be quite heavyweight and suffer from the existing limitations of spark-session around the very complex configuration required to allow the dozens of potential spark deployment scenarios which would be difficult to capture in dbt configuration.
I do not know if hiveserver2 could be adapted to allow pyspark python to be submitted to it, but I doubt it would be easy and it was initially built to support hive which is a sql language. Internally, dbt-spark uses
pyhive
to connect to HS2 which suffers from the same problem.Who will this benefit?
Anyone using apache based implementations or other "on-prem" implementations of spark, e.g. hadoop or other standalone spark clusters, Amazon EMR (livy support) or even as an alternative to the Bigquery implemention which currently relies on Google's jobapi .
It also allows a relatively vendor agnostic interface to other spark implementations which would allow support from some of the more obscure cloud platforms with hadoop implementations.
Are you interested in contributing this feature?
Yes
Anything else?
There is already an implementation of dbt-spark by Cloudera, which is dbt-spark-livy which is licenced under the Apache Licence. It is particularly informative to examine livysession in this repository, which currently just submits to a
sql
session.An implementation would extend this capability and merge it into dbt-spark, allowing both sql and python models to be executed via livy/dbt.
It's surprising there doesn't seem to have been a PR to pull this implementation into this repository.
The text was updated successfully, but these errors were encountered: