Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Washington, DC events scraper #75

Open
dphoria opened this issue Feb 15, 2022 · 14 comments
Open

Washington, DC events scraper #75

dphoria opened this issue Feb 15, 2022 · 14 comments
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@dphoria
Copy link
Collaborator

dphoria commented Feb 15, 2022

Feature Description

A clear and concise description of the feature you're requesting.

Provide a file in cdp_scrapers/instances/ like cdp_scrapers/instances/dc.py or something similar that provides a function that implements API to return Washington, DC city council meetings as List[EventIngestionModel] for a period of time, e.g.

get_events(begin: Optional[datetime] = None, end: Optional[datetime] = None) -> List[EventIngestionModel]

Use Case

Please provide a use case to help us understand your request in context.

Above file and API would be used in deploying a CDP instance for Washington, DC.

@dphoria dphoria added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Feb 15, 2022
@dphoria
Copy link
Collaborator Author

dphoria commented Feb 15, 2022

@dphoria
Copy link
Collaborator Author

dphoria commented Feb 15, 2022

I have hit a road block. Can't quite to come up with a clean way to get events for a given period of time. So far to me, the main candidates for information sources are

  1. https://dccouncil.us/events/list/?tribe_event_display=past&tribe_paged=1
  2. http://dc.granicus.com/viewpublisher.php?view_id=2
  3. https://lims.dccouncil.us/api/help/index.html

I have yet to figure out how to make a query with a period time as parameter(s) into any of the above sources. 1 and 2 can be used to retrieve the most recent N meetings.

3, an API, seemed like a good choice going in, but I am not so warm to it now. It is a good resource to get information about specific bills, laws, etc. It, to me, is almost useless to get agendas, and so on about meetings.

@evamaxfield
Copy link
Member

What about this? https://dccouncil.us/events/2022-01/

You can fill in any year and month and then find the day in the calendar?

@dphoria
Copy link
Collaborator Author

dphoria commented Feb 16, 2022

What about this? https://dccouncil.us/events/2022-01/

You can fill in any year and month and then find the day in the calendar?

Oh man why didn't I think of this approach! I even saw that calendar before LOL. Yes I think this is, at least to me, the best route I've seen thus far. 👍 Awesome.

@dphoria
Copy link
Collaborator Author

dphoria commented Feb 21, 2022

Finally got around to making a first draft. Just getting the minimal now. https://gist.github.com/dphoria/7bea514b1a201f33ade2cf8c8d9fa707
Made a stand-alone file for now for easier development and testing.

import washington_dc
from datetime import datetime
washington_dc.get_events_on_date(datetime(2022, 2, 1))
[
    EventIngestionModel(
        body=Body(name='Committee of the Whole', is_active=True, start_datetime=None, description=None, end_datetime=None, external_source_id=None),
        sessions=[
            Session(
                session_datetime=datetime.datetime(2022, 2, 1, 12, 0),
                video_uri='http://archive-media.granicus.com:443/OnDemand/dc/dc_2bc5049c-4415-4cbe-a069-35623328a371.mp4',
                session_index=0,
                caption_uri='https://dc.granicus.com/TranscriptViewer.php?view_id=4&clip_id=7039',
                external_source_id=None,
            ),
        ],
        event_minutes_items=None,
        agenda_uri='https://dccouncil.us/wp-content/uploads/2022/01/2.1.22-COW-Agenda_ADDITIONAL-1.pdf',
        minutes_uri=None,
        static_thumbnail_uri=None,
        hover_thumbnail_uri=None,
        external_source_id=None,
    ),
    EventIngestionModel(
        body=Body(name='City Council', is_active=True, start_datetime=None, description=None, end_datetime=None, external_source_id=None),
        sessions=[
            Session(
                session_datetime=datetime.datetime(2022, 2, 1, 13, 0),
                video_uri='http://archive-media.granicus.com:443/OnDemand/dc/dc_dc26ab8b-ac05-48cd-968e-94ba67282a87.mp4',
                session_index=0,
                caption_uri='https://dc.granicus.com/TranscriptViewer.php?view_id=3&clip_id=7040',
                external_source_id=None,
            ),
        ],
        event_minutes_items=None,
        agenda_uri='https://dccouncil.us/wp-content/uploads/2021/12/February-1-2022-Legislative-Meeting-2.pdf',
        minutes_uri=None,
        static_thumbnail_uri=None,
        hover_thumbnail_uri=None,
        external_source_id=None,
    ),
]

@dphoria
Copy link
Collaborator Author

dphoria commented Feb 21, 2022

Foremost question in my head is best way to get votes. I think
https://lims.dccouncil.us/
https://lims.dccouncil.us/api/help/index.html
using information parsed from an event page like
https://dccouncil.us/event/legislative-meeting-86/

@dphoria
Copy link
Collaborator Author

dphoria commented Feb 21, 2022

What is highly disappointing is that I thought DC used to have an event's minute items listed in the lower left table on their video player. That seems to be no longer the case?

e.g. On http://dc.granicus.com/ViewPublisher.php?view_id=3, click on any "Video" link on the right. The popup is largely empty with just the video. That used to have a lot of useful information we could have used to get EventMinutesItem, etc.

@AZIZXlaouiti
Copy link

@dphoria i did notice that along with the absence of pdf document and sometimes captions aren't available

@evamaxfield
Copy link
Member

Nice job!!

Can't comment on the PDF document but I wouldn't worry if the captions are optionally available. Seattle has captions for roughly 95% of meetings. If captions aren't available we will roll back to Google. No worries.

Excited to see this progress!!

@dphoria
Copy link
Collaborator Author

dphoria commented Mar 17, 2022

Any luck in adding to the scraper, @AZIZXlaouiti ? I've been working on other issues recently; probably will be for another couple more weeks. After that I may be able to hop back on this if necessary. Anyway just wanted to check in.

@AZIZXlaouiti
Copy link

@dphoria i had some busy weeks (family / interview) related so i wasn't active as i wanted to be but i will resume the work this week . My apologies.

@AZIZXlaouiti
Copy link

@dphoria i managed to get the event_minutes added . i parsed the pdf from agenda_uri and managed to get all the legistlation_number after that i'll have to use lims api to get the votes/ votes status /persons.

@AZIZXlaouiti
Copy link

AZIZXlaouiti commented Mar 19, 2022

https://gist.github.com/AZIZXlaouiti/b3b0ccab24a1fbd0586fb8756fc85c1c

[
   EventIngestionModel(body=Body("name=""Committee of the Whole",
   "is_active=True",
   "start_datetime=None",
   "description=None",
   "end_datetime=None",
   "external_source_id=None)",
   "sessions="[
      Session(session_datetime=datetime.datetime(2022, 2 , 1, 12 ,0),
      "video_uri=""http://archive-media.granicus.com:443/OnDemand/dc/dc_2bc5049c-4415-4cbe-a069-35623328a371.mp4",
      session_index=0,
      "caption_uri=""https://dc.granicus.com/TranscriptViewer.php?view_id=4&clip_id=7039",
      "external_source_id=None)"
   ],
   "event_minutes_items="[
      "EventMinutesItem(minutes_item=MinutesItem(name=""Bill 24-117",
      "description=None",
      "external_source_id=None)",
      "index=None",
      "matter=Matter(name=""B24-0117",
      "matter_type=None",
      "title=""Armstead Barnett Way Designation Act of 2021",
      "result_status=None",
      "sponsors=None",
      "external_source_id=None)",
      "supporting_files=None",
      "decision=None",
      "votes=None)",
   ],
   "agenda_uri=""https://dccouncil.us/wp-content/uploads/2022/01/2.1.22-COW-Agenda_ADDITIONAL-1.pdf",
   "minutes_uri=None",
   "static_thumbnail_uri=None",
   "hover_thumbnail_uri=None",
   "external_source_id=None)",
   "EventIngestionModel(body=Body(name=""City Council",
   "is_active=True",
   "start_datetime=None",
   "description=None",
   "end_datetime=None",
   "external_source_id=None)",
   "sessions="[
      Session(session_datetime=datetime.datetime(2022, 2, 1, 13 ,0),
      "video_uri=""http://archive-media.granicus.com:443/OnDemand/dc/dc_dc26ab8b-ac05-48cd-968e-94ba67282a87.mp4",
      session_index=0,
      "caption_uri=""https://dc.granicus.com/TranscriptViewer.php?view_id=3&clip_id=7040",
      "external_source_id=None)"
   ],
   "event_minutes_items="[
      "EventMinutesItem(minutes_item=MinutesItem(name=""CER 24-125",
      "description=None",
      "external_source_id=None)",
      "index=None",
      "matter=Matter(name=""CER24-0125",
      "matter_type=None",
      "title=""Beverly Odoms-Johnson Posthumous Recognition Ceremonial Resolution of 2022",
      "result_status=None",
      "sponsors=None",
      "external_source_id=None)",
      "supporting_files=None",
      "decision=None",
      "votes=None)",
   ],
   "agenda_uri=""https://dccouncil.us/wp-content/uploads/2021/12/February-1-2022-Legislative-Meeting-2.pdf",
   "minutes_uri=None",
   "static_thumbnail_uri=None",
   "hover_thumbnail_uri=None",
   "external_source_id=None)"
]

@dphoria
Copy link
Collaborator Author

dphoria commented Mar 21, 2022

@dphoria i had some busy weeks (family / interview) related so i wasn't active as i wanted to be but i will resume the work this week . My apologies.

No absolutely no need for any apologies. 😄 I was just curious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants