Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify a sample set of government sites to inventory #4629

Open
Tracked by #4489
drewbo opened this issue Oct 21, 2024 · 2 comments
Open
Tracked by #4489

Identify a sample set of government sites to inventory #4629

drewbo opened this issue Oct 21, 2024 · 2 comments
Assignees

Comments

@drewbo
Copy link
Contributor

drewbo commented Oct 21, 2024

No description provided.

@sknep
Copy link
Contributor

sknep commented Nov 20, 2024

Here's a cleaned dataset, about 5000 sites

  • Removing all columns starting with source_ (to make the set smaller and bc its not important where they came from)
  • Removing all .mil sites
  • Removing sites that aren’t public = true or final_url_live = true
  • removing any rows that resolve to github or non gov-owned .com websites
  • Removing duplicated sites:
    • for rows with the same final_url_website, look for target_url_redirects = false and keep only that record,
    • if all of the rows are redirects, keep lowest final_url_status_code
  • removing sites that don't have text/html media type in final_url_media_type
  • removing sites with ftp or admin in the domain

cleaned_federal_site_data.csv

working notes file:

https://docs.google.com/document/d/1arE0mDjwP6NPY_uOLP5DMwzlUMdaOKtvz87S5u2qmJ4/edit?tab=t.0

@drewbo drewbo self-assigned this Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants