Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle dash for xff, and region id starting the path #712

Merged
merged 1 commit into from
Oct 12, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions etc/cdn-log-shipper/log-shipper.yml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ Resources:
const crypto = require('crypto');

const IPV4_MASK = /\.[0-9]{1,3}$/;

const maskIp = (ip, field) => {
if (ip.match(IPV4_MASK)) {
return ip.replace(IPV4_MASK, '.0');
Expand Down Expand Up @@ -119,6 +120,16 @@ Resources:
});
};

const findIp = (xff, ip) => {
if (xff === '-') {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug that made all the hashed ips the same - the xff is coming through most of the time as -, which is not blank, but also not an ip. This dash was being used as the ip instead of the client ip

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof - good catch. I think I had similar in the counts-lambda, but apparently forgot about it here.

return ip;
} else if (xff) {
return xff.split(',').map(s => s.trim()).filter(s => s)[0];
} else {
return ip;
}
};

const PODCAST_IDS = process.env.PODCAST_IDS.split(',').map(s => s.trim()).filter(s => s);

const IGNORE_PATHS = ['/', '/favicon.ico', '/robots.txt'];
Expand Down Expand Up @@ -146,6 +157,12 @@ Resources:
// podcast id and episode guid (only works for dovetail3-cdn requests)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related directly to this change but ...

For re-processing purposes, it may be useful to also have this lambda log what S3 input file it's processing, and how many rows it had. Just above this line somewhere:

console.info(`Read ${rows.length} rows from s3://${Bucket}/${Key}`);

const datas = mappedRows.filter(data => {
const parts = data['cs-uri-stem'].split('/').filter(s => s);

// if the path starts with a region like usw2, shift that off
if (parts[0] && parts[0].match(/^[a-z][a-z0-9\-]+$/)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the other bug, that we have requests with an aws region name prefix, like /usw2/
These were all getting filtered out

parts.shift();
}

if (parts.length === 4) {
data['prx-podcast-id'] = parts[0];
data['prx-episode-guid'] = parts[1];
Expand All @@ -163,8 +180,7 @@ Resources:
// calculate listener_ids
datas.forEach(data => {
// use leftmost XFF or IP
const xffParts = (data['x-forwarded-for'] || '').split(',').map(s => s.trim()).filter(s => s);
const leftMostIp = xffParts[0] || data['c-ip'];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comparison was picking the dash, -, value of the xff over the client ip

const leftMostIp = findIp(data['x-forwarded-for'], data['c-ip']);

// truncate ipv6 but not ipv4
const truncatedIp = leftMostIp.includes(':') ? maskIp(leftMostIp, 'listener-id') : leftMostIp;
Expand Down