`debug_mode` breaks training workflow #2428

ADBond · 2024-09-26T16:48:30Z

If we turn on debug_mode, u-training breaks, but only if we have already run estimate_probability_two_random_records_match. We get Binder Error: Table "l" does not have a column named "__splink_salt"

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.NameComparison("first_name"),
        cl.JaroAtThresholds("surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.DamerauLevenshteinAtThresholds("city").configure(
            term_frequency_adjustments=True
        ),
        cl.EmailComparison("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ],
)

linker = Linker(df, settings, db_api)
db_api.debug_mode = True
linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")],
    recall=0.7,
)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

This is to do with us materialising a table when salting wasn't required, but then trying to access it when salting is required.

I have a weird sense of déjà vu with this - maybe came across it another time in the last few weeks? But couldn't find any other issue, so opening one here.

The text was updated successfully, but these errors were encountered:

RobinL · 2024-09-30T07:33:58Z

Took a quick look and haven't got anywhere but this is probably useful when we get more time to have a proper look

Script to help understand debug mode to pinpoint issues

import logging

import duckdb

from splink import DuckDBAPI
from splink.internals.duckdb.dataframe import DuckDBDataFrame
from splink.internals.pipeline import CTEPipeline

con = duckdb.connect()
db_api = DuckDBAPI(connection=con)
db_api.debug_mode = True

# logging.basicConfig(format="%(message)s")
# logging.getLogger("splink").setLevel(1)


sql = """
CREATE TABLE input_data_abc AS
SELECT 'john' AS first_name, 'doe' AS surname
UNION ALL SELECT 'jane', 'smith'
UNION ALL SELECT 'jim', 'johnson'
"""

con.execute(sql)


input_sdf_1 = DuckDBDataFrame(
    templated_name="input_data", physical_name="input_data_abc", db_api=db_api
)


pipeline = CTEPipeline([input_sdf_1])

sql = f"""
select upper(first_name) as first_name from {input_sdf_1.templated_name}
"""
pipeline.enqueue_sql(sql, "input_upper")
# In debug mode, this is run and an output table called 'input_upper'
# is created.


sql = """
select substr(first_name, 1, 2) as first_name_substr from input_upper
"""


pipeline.enqueue_sql(sql, "__splink__output_table")


output_sdf = db_api.sql_pipeline_to_splink_dataframe(pipeline)

output_sdf.as_duckdbpyrelation()

ADBond · 2024-09-30T08:17:11Z

Iirc the issue is basically this:

estimate ptrrm calls _cumulative_comparisons_to_be_scored_from_blocking_rules, which creates concat table, explicitly with salting off
In debug mode this table is materialised (normally this is just in a CTE, so it doesn't matter)
In estimate u we then read this cached table. But we are using duckdb so we expect a salt column, as salting is required, and so we hit an error when we try to reference it.

I think the actual issue should be relatively easy to sort out, but I wonder if a wider change might be warranted. Maybe changing how (or if) caching works when we're in debug mode

RobinL · 2024-09-30T09:00:17Z

One thing (probably separate from that specific issue) is that I think we should be outputting the intermediate tables with a unique name (e.g. with asci_uuid), and then passing them into the next step. Doesn't feel very robust to physically instantitiate tables with very generic names to the db.

ADBond added the bug Something isn't working label Sep 26, 2024

ADBond added the debug_mode label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`debug_mode` breaks training workflow #2428

`debug_mode` breaks training workflow #2428

ADBond commented Sep 26, 2024

RobinL commented Sep 30, 2024

ADBond commented Sep 30, 2024

RobinL commented Sep 30, 2024

debug_mode breaks training workflow #2428

debug_mode breaks training workflow #2428

Comments

ADBond commented Sep 26, 2024

RobinL commented Sep 30, 2024

ADBond commented Sep 30, 2024

RobinL commented Sep 30, 2024

`debug_mode` breaks training workflow #2428

`debug_mode` breaks training workflow #2428