Clarification on how to get a list of domains associated with fingerprinting #125

birdsarah · 2018-10-20T07:48:03Z

I've been working through the various data options trying to compile a list of domains that classify as fingerprinting. I'm getting mixed results and wondering if you can clarify, what you think of as the canonical approach.

Apologies if I'm just misreading the documentation. I'm happy to submit a PR to docs if you think it would be useful.

I can use the data source, and get a list of tracker ids as follows

fp_trackers = set()
regions = {'de', 'eu', 'fr', 'global', 'us'}
for region in regions:
    who_tracks_data = DataSource(region=region)
    who_tracks_fp = who_tracks_data.trackers.df[who_tracks_data.trackers.df.bad_qs > 0.1]
    fp_trackers.update(list(who_tracks_fp.tracker.values))

This gives me 193 trackers. I can then map this to domains using the map from create_tracker_map.

could_not_find = []
domains = set()
for tracker in fp_trackers:
    try:
        domains.update(tracker_info['trackers'][tracker]['domains'])
    except KeyError:
        could_not_find.append(tracker)

This will give me 326 domains.

If I take a different route, and read in all the csv files under assets folders labeled domains.csv, I can get a list of domains like this

domains_df = pd.concat([
    pd.read_csv(file, parse_dates=['month'])
    for file in asset_paths['domains'] # I have previously assembled all the paths
])
fingerprinting_trackers = domains_df[domains_df.bad_qs > 0.1].host_tld.unique()

But this gives me a list of 292 domains.

I can think of an explanation for this - not all host_tld's might have a bad_qs that meets the threshold but they've been added to the tracker map for other reasons.

However, given that the other csv files may also be relevant, I was starting to lose confidence and so wanted to check in.

Many thanks in advance for your help.

The text was updated successfully, but these errors were encountered:

sammacbeth · 2018-10-22T08:46:46Z

The domains.csv and trackers.csv files represent different aggregations of the same data. If we consider the fingerprinting case:

domains.csv counts the proportion of times when each hostname (at TLD+1 level) was seen sending a fingerprint (or suspected fingerprint) in a third-party context on a page.
trackers.csv counts the proportion for any of the hostnames associated with a tracker - from the mapping in the tracker database.

For the majority of trackers the relationship between domains and trackers is one-to-one. For others the domains files will show to which domains fingerprinting data is sent, while the trackers view shows a more aggregated picture of what the tracker is doing.

For example, Facebook uses facebook.net as a CDN, and we can see from the stats little evidence of tracking on this domain. The tracking requests are aimed at facebook.com where they have the user's login cookie. In the tracker view we report the aggregate view of both domains, which shows the aggregate view of Facebook's third-party traffic.

I hope that clears things up a little for you. From your use-case it looks like the domains.csv data view would fit better.

ecnmst · 2018-10-22T12:36:15Z

Hi @birdsarah, many thanks for the PR and issues raised. domains.csv is currently not exposed via the API. If you'd find this useful, you can add this to loader.py, extending our API. Here's one way to do it:

class Domains(PandasDataLoader):
    def __init__(self, data_months, region="global"):
        super().__init__(data_months, name="domains", region=region)

then add this to class DataSource, still on loader.py

       ...
        self.domains = Domains(
            data_months=self.data_months,
            region=region
        )

Now you can consume domains via the DataSource:

data = DataSource(region="global")
domains = data.domains.df

where domains would be a pandas dataframe of all months for which domains.csv is available.

birdsarah · 2018-10-22T18:48:32Z

Thanks so much for this feedback @sammacbeth @ecnmst. This is extremely helpful.
I'll leave this open and plan to make the addition to loader.py that @ecnmst proposes.

birdsarah · 2018-10-22T18:49:09Z

But if, on reflection, you don't want the update to loader.py feel free to close the issue.

birdsarah changed the title ~~Clarification on how to get a list domains associated with fingerprinting~~ Clarification on how to get a list of domains associated with fingerprinting Oct 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on how to get a list of domains associated with fingerprinting #125

Clarification on how to get a list of domains associated with fingerprinting #125

birdsarah commented Oct 20, 2018

sammacbeth commented Oct 22, 2018

ecnmst commented Oct 22, 2018

birdsarah commented Oct 22, 2018

birdsarah commented Oct 22, 2018

Clarification on how to get a list of domains associated with fingerprinting #125

Clarification on how to get a list of domains associated with fingerprinting #125

Comments

birdsarah commented Oct 20, 2018

sammacbeth commented Oct 22, 2018

ecnmst commented Oct 22, 2018

birdsarah commented Oct 22, 2018

birdsarah commented Oct 22, 2018