-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
debug_mode
breaks training workflow
#2428
Comments
Took a quick look and haven't got anywhere but this is probably useful when we get more time to have a proper look Script to help understand debug mode to pinpoint issuesimport logging
import duckdb
from splink import DuckDBAPI
from splink.internals.duckdb.dataframe import DuckDBDataFrame
from splink.internals.pipeline import CTEPipeline
con = duckdb.connect()
db_api = DuckDBAPI(connection=con)
db_api.debug_mode = True
# logging.basicConfig(format="%(message)s")
# logging.getLogger("splink").setLevel(1)
sql = """
CREATE TABLE input_data_abc AS
SELECT 'john' AS first_name, 'doe' AS surname
UNION ALL SELECT 'jane', 'smith'
UNION ALL SELECT 'jim', 'johnson'
"""
con.execute(sql)
input_sdf_1 = DuckDBDataFrame(
templated_name="input_data", physical_name="input_data_abc", db_api=db_api
)
pipeline = CTEPipeline([input_sdf_1])
sql = f"""
select upper(first_name) as first_name from {input_sdf_1.templated_name}
"""
pipeline.enqueue_sql(sql, "input_upper")
# In debug mode, this is run and an output table called 'input_upper'
# is created.
sql = """
select substr(first_name, 1, 2) as first_name_substr from input_upper
"""
pipeline.enqueue_sql(sql, "__splink__output_table")
output_sdf = db_api.sql_pipeline_to_splink_dataframe(pipeline)
output_sdf.as_duckdbpyrelation() |
Iirc the issue is basically this:
I think the actual issue should be relatively easy to sort out, but I wonder if a wider change might be warranted. Maybe changing how (or if) caching works when we're in debug mode |
One thing (probably separate from that specific issue) is that I think we should be outputting the intermediate tables with a unique name (e.g. with asci_uuid), and then passing them into the next step. Doesn't feel very robust to physically instantitiate tables with very generic names to the db. |
If we turn on
debug_mode
, u-training breaks, but only if we have already runestimate_probability_two_random_records_match
. We getBinder Error: Table "l" does not have a column named "__splink_salt"
This is to do with us materialising a table when salting wasn't required, but then trying to access it when salting is required.
I have a weird sense of déjà vu with this - maybe came across it another time in the last few weeks? But couldn't find any other issue, so opening one here.
The text was updated successfully, but these errors were encountered: