-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on how to get a list of domains associated with fingerprinting #125
Comments
The
For the majority of trackers the relationship between domains and trackers is one-to-one. For others the domains files will show to which domains fingerprinting data is sent, while the trackers view shows a more aggregated picture of what the tracker is doing. For example, Facebook uses I hope that clears things up a little for you. From your use-case it looks like the |
Hi @birdsarah, many thanks for the PR and issues raised. class Domains(PandasDataLoader):
def __init__(self, data_months, region="global"):
super().__init__(data_months, name="domains", region=region) then add this to ...
self.domains = Domains(
data_months=self.data_months,
region=region
) Now you can consume domains via the data = DataSource(region="global")
domains = data.domains.df where |
Thanks so much for this feedback @sammacbeth @ecnmst. This is extremely helpful. |
But if, on reflection, you don't want the update to loader.py feel free to close the issue. |
I've been working through the various data options trying to compile a list of domains that classify as fingerprinting. I'm getting mixed results and wondering if you can clarify, what you think of as the canonical approach.
Apologies if I'm just misreading the documentation. I'm happy to submit a PR to docs if you think it would be useful.
I can use the data source, and get a list of tracker ids as follows
This gives me 193 trackers. I can then map this to domains using the map from
create_tracker_map.
This will give me 326 domains.
If I take a different route, and read in all the csv files under assets folders labeled
domains.csv
, I can get a list of domains like thisBut this gives me a list of 292 domains.
I can think of an explanation for this - not all host_tld's might have a
bad_qs
that meets the threshold but they've been added to the tracker map for other reasons.However, given that the other csv files may also be relevant, I was starting to lose confidence and so wanted to check in.
Many thanks in advance for your help.
The text was updated successfully, but these errors were encountered: