Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug_mode breaks training workflow #2428

Open
ADBond opened this issue Sep 26, 2024 · 3 comments
Open

debug_mode breaks training workflow #2428

ADBond opened this issue Sep 26, 2024 · 3 comments
Labels
bug Something isn't working debug_mode

Comments

@ADBond
Copy link
Contributor

ADBond commented Sep 26, 2024

If we turn on debug_mode, u-training breaks, but only if we have already run estimate_probability_two_random_records_match. We get Binder Error: Table "l" does not have a column named "__splink_salt"

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.NameComparison("first_name"),
        cl.JaroAtThresholds("surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.DamerauLevenshteinAtThresholds("city").configure(
            term_frequency_adjustments=True
        ),
        cl.EmailComparison("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ],
)

linker = Linker(df, settings, db_api)
db_api.debug_mode = True
linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")],
    recall=0.7,
)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

This is to do with us materialising a table when salting wasn't required, but then trying to access it when salting is required.

I have a weird sense of déjà vu with this - maybe came across it another time in the last few weeks? But couldn't find any other issue, so opening one here.

@ADBond ADBond added the bug Something isn't working label Sep 26, 2024
@RobinL
Copy link
Member

RobinL commented Sep 30, 2024

Took a quick look and haven't got anywhere but this is probably useful when we get more time to have a proper look

Script to help understand debug mode to pinpoint issues
import logging

import duckdb

from splink import DuckDBAPI
from splink.internals.duckdb.dataframe import DuckDBDataFrame
from splink.internals.pipeline import CTEPipeline

con = duckdb.connect()
db_api = DuckDBAPI(connection=con)
db_api.debug_mode = True

# logging.basicConfig(format="%(message)s")
# logging.getLogger("splink").setLevel(1)


sql = """
CREATE TABLE input_data_abc AS
SELECT 'john' AS first_name, 'doe' AS surname
UNION ALL SELECT 'jane', 'smith'
UNION ALL SELECT 'jim', 'johnson'
"""

con.execute(sql)


input_sdf_1 = DuckDBDataFrame(
    templated_name="input_data", physical_name="input_data_abc", db_api=db_api
)


pipeline = CTEPipeline([input_sdf_1])

sql = f"""
select upper(first_name) as first_name from {input_sdf_1.templated_name}
"""
pipeline.enqueue_sql(sql, "input_upper")
# In debug mode, this is run and an output table called 'input_upper'
# is created.


sql = """
select substr(first_name, 1, 2) as first_name_substr from input_upper
"""


pipeline.enqueue_sql(sql, "__splink__output_table")


output_sdf = db_api.sql_pipeline_to_splink_dataframe(pipeline)

output_sdf.as_duckdbpyrelation()

@ADBond
Copy link
Contributor Author

ADBond commented Sep 30, 2024

Iirc the issue is basically this:

  • estimate ptrrm calls _cumulative_comparisons_to_be_scored_from_blocking_rules, which creates concat table, explicitly with salting off
  • In debug mode this table is materialised (normally this is just in a CTE, so it doesn't matter)
  • In estimate u we then read this cached table. But we are using duckdb so we expect a salt column, as salting is required, and so we hit an error when we try to reference it.

I think the actual issue should be relatively easy to sort out, but I wonder if a wider change might be warranted. Maybe changing how (or if) caching works when we're in debug mode

@RobinL
Copy link
Member

RobinL commented Sep 30, 2024

One thing (probably separate from that specific issue) is that I think we should be outputting the intermediate tables with a unique name (e.g. with asci_uuid), and then passing them into the next step. Doesn't feel very robust to physically instantitiate tables with very generic names to the db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working debug_mode
Projects
None yet
Development

No branches or pull requests

2 participants