Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6/craigslist scraper #56

Merged
merged 50 commits into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
86eecae
Fixed csv output
waseem-polus Oct 18, 2023
cb9bbed
removed debug print statements
waseem-polus Oct 18, 2023
8255e4b
moved scrapers into src dir
waseem-polus Oct 26, 2023
880cc97
renamed craigslist to craigslist-api
waseem-polus Oct 26, 2023
58cf7ad
craigslist scraper collects image data
waseem-polus Oct 27, 2023
59455e4
nested craigslist homepage and listing scrapers in their own folder
waseem-polus Oct 27, 2023
403e579
set selenium in headless mode
waseem-polus Oct 27, 2023
16adc90
can scrape description and attributes of craigslist listing
waseem-polus Oct 27, 2023
e08ac19
can scrape description and attributes of craigslist listing
waseem-polus Oct 27, 2023
ef58802
replaced spaces with tabs
waseem-polus Oct 27, 2023
be1975a
updated miles label to odometer
waseem-polus Oct 27, 2023
ea32b66
added .pyc files to .gitignore
waseem-polus Oct 30, 2023
b3fe6af
removed un-used import
waseem-polus Oct 30, 2023
6756cd1
pulled changes from main
waseem-polus Oct 31, 2023
66ac3dd
delete duplicate craigslist file
waseem-polus Oct 31, 2023
8d2acb4
moved scrapers dir to root
waseem-polus Oct 31, 2023
b4cc2f4
Grouped homepage and listing files into one craigslist file
waseem-polus Oct 31, 2023
b2a1087
Removed craigslist-api file
waseem-polus Oct 31, 2023
f1c7372
extracted utils from craigslist scraper
waseem-polus Oct 31, 2023
5e694dd
removed un-used function
waseem-polus Oct 31, 2023
912006b
Updated facebook scraper
waseem-polus Oct 31, 2023
f1298de
Extracted scraper logic into utils
waseem-polus Oct 31, 2023
f6c0a03
removed un-used imports
waseem-polus Oct 31, 2023
70c01d4
Track scraper versions in db
waseem-polus Oct 31, 2023
2182688
use link as _id for db
waseem-polus Oct 31, 2023
146948a
Fixed import issues
waseem-polus Oct 31, 2023
7f893e0
craigslist scraper functionality complete
waseem-polus Oct 31, 2023
b40d0c7
extracted click function into utils
waseem-polus Oct 31, 2023
8d9b30a
Facebook listing scraper incomplete
waseem-polus Oct 31, 2023
e91838c
added pipfile and pipfile.lock to manage dependencies
waseem-polus Nov 9, 2023
0e1cd28
organize file structure
waseem-polus Nov 9, 2023
c2694a4
Update README with scraper instructions
waseem-polus Nov 9, 2023
816d185
pipenv shell in README
waseem-polus Nov 9, 2023
667183d
added pipfile scripts
waseem-polus Nov 9, 2023
f14de60
update README with pipenv scripts
waseem-polus Nov 9, 2023
dcd0959
updated db and collection name
waseem-polus Nov 10, 2023
3e202b0
finally got docker running locally T-T
waseem-polus Nov 13, 2023
21e751c
use environ.get instead of getenv
waseem-polus Nov 13, 2023
5a5515f
updated pipenv scripts to work for docker locally
waseem-polus Nov 17, 2023
5e86f1c
update README with new pipenv commands
waseem-polus Nov 17, 2023
81f4f40
Add missing new lines
waseem-polus Nov 17, 2023
fd2cf6f
use regex to determine if facebook or craigslist link
waseem-polus Nov 30, 2023
378caeb
Pull changes from main branch
waseem-polus Nov 30, 2023
cd2af8c
added new lines to fix linting erros
waseem-polus Nov 30, 2023
6ea8e12
fixed import order and spacing for isort
waseem-polus Nov 30, 2023
a8ab8b7
fixed flake8 errors with black linter
waseem-polus Nov 30, 2023
eda33a9
added flake8, black, and isort to dev dependencies
waseem-polus Nov 30, 2023
dc860ec
fixed hadolint errors in dockerfile
waseem-polus Dec 1, 2023
d5a6170
Added latest versions of yum packages to dockerfile
waseem-polus Dec 1, 2023
3a9e8c8
isort plz T-T
waseem-polus Dec 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,4 +130,7 @@ dist
.pnp.*

# misc
*.DS_STORE
*.DS_STORE

# python
*.pyc
35 changes: 34 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,37 @@ Senior Design Repository for the Statefarm Automotive Fraud Project
Make a copy of the ``.env.example`` file and make the following changes.
1. remove ``.example`` from the extension
2. Paste the username and password provided in MongoDB Atlas (if you should have access but do not, please contact @waseem-polus)
3. Paste the connection URL provided provided in MongoDB Atlas. Include the password and username fields using ``${VARIABLE}`` syntax to embed the value of the variable
3. Paste the connection URL provided provided in MongoDB Atlas. Include the password and username fields using ``${VARIABLE}`` syntax to embed the value of the variable

## Run Scrapers locally
**Prerequisites**
- python3
- pipenv

**Installing dependencies**
Navigate to ``scrapers/`` and open the virtual environment using
```bash
pipenv shell
```
Then install dependencies using
```bash
pipenv install
```

**Scraper Usage**
To create build a Docker Image use
```bash
pipenv run build
```
to run a docker container "smarecontainer" use
```bash
pipenv run cont
```
then
```bash
# Scrape Craigsist homepage
pipenv run craigslist

# Scrape Facebook Marketplace homepage
pipenv run facebook
```
File renamed without changes.
22 changes: 22 additions & 0 deletions scrapers/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM public.ecr.aws/lambda/python@sha256:f0c3116a56d167eba8021a5d7c595f969835fbe78826303326f80de00d044733 as build
RUN yum install -y unzip && \
curl -Lo "/tmp/chromedriver-linux64.zip" "https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/linux64/chromedriver-linux64.zip" && \
curl -Lo "/tmp/chrome-linux64.zip" "https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/linux64/chrome-linux64.zip" && \
unzip /tmp/chromedriver-linux64.zip -d /opt/ && \
unzip /tmp/chrome-linux64.zip -d /opt/

FROM public.ecr.aws/lambda/python@sha256:f0c3116a56d167eba8021a5d7c595f969835fbe78826303326f80de00d044733
RUN yum install atk cups-libs gtk3 libXcomposite alsa-lib \
libXcursor libXdamage libXext libXi libXrandr libXScrnSaver \
libXtst pango at-spi2-atk libXt xorg-x11-server-Xvfb \
xorg-x11-xauth dbus-glib dbus-glib-devel -y
COPY --from=build /opt/chrome-linux64 /opt/chrome
COPY --from=build /opt/chromedriver-linux64 /opt/

COPY scrapers.py ./
COPY src ./src
COPY requirements.txt ./

RUN pip install -r requirements.txt

CMD [ "scrapers.craigslist" ]
23 changes: 23 additions & 0 deletions scrapers/Pipfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[scripts]
build = "docker build --platform linux/amd64 -t smare ."
cont = "docker run --name smarecontainer -d smare:latest"
exec = "docker exec -it smarecontainer"
craigslist = "pipenv run exec python3 -c 'import scrapers; scrapers.craigslist(\"\",\"\")'"
facebook = "pipenv run exec python3 -c 'import scrapers; scrapers.facebook(\"\",\"\")'"

[packages]
selenium = "*"
bs4 = "*"
pymongo = "*"
typer = "*"
python-dotenv = "*"

[dev-packages]

[requires]
python_version = "3.11"
281 changes: 281 additions & 0 deletions scrapers/Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading