Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6/craigslist scraper #56

Merged
merged 50 commits into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
86eecae
Fixed csv output
waseem-polus Oct 18, 2023
cb9bbed
removed debug print statements
waseem-polus Oct 18, 2023
8255e4b
moved scrapers into src dir
waseem-polus Oct 26, 2023
880cc97
renamed craigslist to craigslist-api
waseem-polus Oct 26, 2023
58cf7ad
craigslist scraper collects image data
waseem-polus Oct 27, 2023
59455e4
nested craigslist homepage and listing scrapers in their own folder
waseem-polus Oct 27, 2023
403e579
set selenium in headless mode
waseem-polus Oct 27, 2023
16adc90
can scrape description and attributes of craigslist listing
waseem-polus Oct 27, 2023
e08ac19
can scrape description and attributes of craigslist listing
waseem-polus Oct 27, 2023
ef58802
replaced spaces with tabs
waseem-polus Oct 27, 2023
be1975a
updated miles label to odometer
waseem-polus Oct 27, 2023
ea32b66
added .pyc files to .gitignore
waseem-polus Oct 30, 2023
b3fe6af
removed un-used import
waseem-polus Oct 30, 2023
6756cd1
pulled changes from main
waseem-polus Oct 31, 2023
66ac3dd
delete duplicate craigslist file
waseem-polus Oct 31, 2023
8d2acb4
moved scrapers dir to root
waseem-polus Oct 31, 2023
b4cc2f4
Grouped homepage and listing files into one craigslist file
waseem-polus Oct 31, 2023
b2a1087
Removed craigslist-api file
waseem-polus Oct 31, 2023
f1c7372
extracted utils from craigslist scraper
waseem-polus Oct 31, 2023
5e694dd
removed un-used function
waseem-polus Oct 31, 2023
912006b
Updated facebook scraper
waseem-polus Oct 31, 2023
f1298de
Extracted scraper logic into utils
waseem-polus Oct 31, 2023
f6c0a03
removed un-used imports
waseem-polus Oct 31, 2023
70c01d4
Track scraper versions in db
waseem-polus Oct 31, 2023
2182688
use link as _id for db
waseem-polus Oct 31, 2023
146948a
Fixed import issues
waseem-polus Oct 31, 2023
7f893e0
craigslist scraper functionality complete
waseem-polus Oct 31, 2023
b40d0c7
extracted click function into utils
waseem-polus Oct 31, 2023
8d9b30a
Facebook listing scraper incomplete
waseem-polus Oct 31, 2023
e91838c
added pipfile and pipfile.lock to manage dependencies
waseem-polus Nov 9, 2023
0e1cd28
organize file structure
waseem-polus Nov 9, 2023
c2694a4
Update README with scraper instructions
waseem-polus Nov 9, 2023
816d185
pipenv shell in README
waseem-polus Nov 9, 2023
667183d
added pipfile scripts
waseem-polus Nov 9, 2023
f14de60
update README with pipenv scripts
waseem-polus Nov 9, 2023
dcd0959
updated db and collection name
waseem-polus Nov 10, 2023
3e202b0
finally got docker running locally T-T
waseem-polus Nov 13, 2023
21e751c
use environ.get instead of getenv
waseem-polus Nov 13, 2023
5a5515f
updated pipenv scripts to work for docker locally
waseem-polus Nov 17, 2023
5e86f1c
update README with new pipenv commands
waseem-polus Nov 17, 2023
81f4f40
Add missing new lines
waseem-polus Nov 17, 2023
fd2cf6f
use regex to determine if facebook or craigslist link
waseem-polus Nov 30, 2023
378caeb
Pull changes from main branch
waseem-polus Nov 30, 2023
cd2af8c
added new lines to fix linting erros
waseem-polus Nov 30, 2023
6ea8e12
fixed import order and spacing for isort
waseem-polus Nov 30, 2023
a8ab8b7
fixed flake8 errors with black linter
waseem-polus Nov 30, 2023
eda33a9
added flake8, black, and isort to dev dependencies
waseem-polus Nov 30, 2023
dc860ec
fixed hadolint errors in dockerfile
waseem-polus Dec 1, 2023
d5a6170
Added latest versions of yum packages to dockerfile
waseem-polus Dec 1, 2023
3a9e8c8
isort plz T-T
waseem-polus Dec 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,4 +130,7 @@ dist
.pnp.*

# misc
*.DS_STORE
*.DS_STORE

# python
*.pyc
33 changes: 33 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,36 @@ Make a copy of the ``.env.example`` file and make the following changes.
2. Paste the username and password provided in MongoDB Atlas (if you should have access but do not, please contact @waseem-polus)

3. Paste the connection URL provided provided in MongoDB Atlas. Include the password and username fields using ``${VARIABLE}`` syntax to embed the value of the variable

## Run Scrapers locally
**Prerequisites**
- python3
- pipenv

**Installing dependencies**
Navigate to ``scrapers/`` and open the virtual environment using
```bash
pipenv shell
```
Then install dependencies using
```bash
pipenv install
```

**Scraper Usage**
To create build a Docker Image use
```bash
pipenv run build
```
to run a docker container "smarecontainer" use
```bash
pipenv run cont
```
then
```bash
# Scrape Craigsist homepage
pipenv run craigslist

# Scrape Facebook Marketplace homepage
pipenv run facebook
```
File renamed without changes.
2 changes: 2 additions & 0 deletions scrapers/.flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[flake8]
max-line-length = 120
25 changes: 25 additions & 0 deletions scrapers/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
FROM public.ecr.aws/lambda/python@sha256:f0c3116a56d167eba8021a5d7c595f969835fbe78826303326f80de00d044733 as build
RUN yum install -y unzip-* && \
curl -Lo "/tmp/chromedriver-linux64.zip" "https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/linux64/chromedriver-linux64.zip" && \
curl -Lo "/tmp/chrome-linux64.zip" "https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/linux64/chrome-linux64.zip" && \
unzip /tmp/chromedriver-linux64.zip -d /opt/ && \
unzip /tmp/chrome-linux64.zip -d /opt/ && \
yum clean all

FROM public.ecr.aws/lambda/python@sha256:f0c3116a56d167eba8021a5d7c595f969835fbe78826303326f80de00d044733
RUN yum install atk-* cups-libs-* gtk3-* libXcomposite-* alsa-lib-* \
libXcursor-* libXdamage-* libXext-* libXi-* libXrandr-* libXScrnSaver-* \
libXtst-* pango-* at-spi2-atk-* libXt-* xorg-x11-server-Xvfb-* \
xorg-x11-xauth-* dbus-glib-* dbus-glib-devel-* -y && \
yum clean all
COPY --from=build /opt/chrome-linux64 /opt/chrome
COPY --from=build /opt/chromedriver-linux64 /opt/

WORKDIR /var/task
COPY scrapers.py ./
COPY src ./src
COPY requirements.txt ./

RUN pip install --no-cache-dir -r requirements.txt

CMD [ "scrapers.craigslist" ]
26 changes: 26 additions & 0 deletions scrapers/Pipfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[scripts]
build = "docker build --platform linux/amd64 -t smare ."
cont = "docker run --name smarecontainer -d smare:latest"
exec = "docker exec -it smarecontainer"
craigslist = "pipenv run exec python3 -c 'import scrapers; scrapers.craigslist(\"\",\"\")'"
facebook = "pipenv run exec python3 -c 'import scrapers; scrapers.facebook(\"\",\"\")'"

[packages]
selenium = "*"
bs4 = "*"
pymongo = "*"
typer = "*"
python-dotenv = "*"

[dev-packages]
isort = "*"
black = "*"
flake8 = "*"

[requires]
python_version = "3.11"
389 changes: 389 additions & 0 deletions scrapers/Pipfile.lock

Large diffs are not rendered by default.

23 changes: 23 additions & 0 deletions scrapers/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
-i https://pypi.org/simple
attrs==23.1.0; python_version >= '3.7'
beautifulsoup4==4.12.2; python_full_version >= '3.6.0'
bs4==0.0.1
certifi==2023.11.17; python_version >= '3.6'
click==8.1.7; python_version >= '3.7'
dnspython==2.4.2; python_version >= '3.8' and python_version < '4.0'
h11==0.14.0; python_version >= '3.7'
idna==3.6; python_version >= '3.5'
outcome==1.3.0.post0; python_version >= '3.7'
pymongo==4.6.1; python_version >= '3.7'
pysocks==1.7.1
python-dotenv==1.0.0; python_version >= '3.8'
selenium==4.15.2; python_version >= '3.8'
sniffio==1.3.0; python_version >= '3.7'
sortedcontainers==2.4.0
soupsieve==2.5; python_version >= '3.8'
trio==0.23.1; python_version >= '3.8'
trio-websocket==0.11.1; python_version >= '3.7'
typer==0.9.0; python_version >= '3.6'
typing-extensions==4.8.0; python_version >= '3.8'
urllib3[socks]==2.1.0; python_version >= '3.8'
wsproto==1.2.0; python_full_version >= '3.7.0'
45 changes: 45 additions & 0 deletions scrapers/scrapers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import re

import typer
from src import craigslist as cl
from src import database as db
from src import facebook as fb
from src import utils

app = typer.Typer()

craigslistScraperVersion = 1
facebookScraperVersion = 1


@app.command()
def craigslist(event, context):
utils.scrape("craigslist", craigslistScraperVersion)


@app.command()
def facebook(event, context):
utils.scrape("facebook", facebookScraperVersion)


@app.command()
def link(link: str):
clPattern = re.compile(
r"^https://[a-zA-Z-]+\.craigslist\.org(?:/[^\s?]*)?(?:\?[^\s]*)?$"
)
fbPattern = re.compile(
r"^https://www\.facebook\.com/marketplace(?:/[^\s?]*)?(?:\?[^\s]*)?$"
)

if clPattern.match(link):
newInfo = cl.scrapeListing(link)
db.update(link, newInfo)
elif fbPattern.match(link):
newInfo = fb.scrapeListing(link)
print(newInfo)
else:
print("Not a Craigslist nor a Facebook Marketplace link")


if __name__ == "__main__":
app()
Empty file added scrapers/src/__init__.py
Empty file.
148 changes: 148 additions & 0 deletions scrapers/src/craigslist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
import time

from bs4 import BeautifulSoup

from . import utils


def loadPageResources(driver):
scroll = 100

print("Waiting to load...")
time.sleep(2)

utils.scrollTo(scroll, driver)

loadImgButtons = driver.find_elements("class name", "slider-back-arrow")

time.sleep(2)

# Emulate a user scrolling
for i in range(len(loadImgButtons)):
scroll += 100
utils.scrollTo(scroll, driver)

utils.clickOn(loadImgButtons[i], driver)

time.sleep(0.5)


def setupURLs(oldestAllowedCars):
# List of TX cities to scrape; can be expanded
cities = [
"abilene",
"amarillo",
"austin",
"beaumont",
"brownsville",
"collegestation",
"corpuschristi",
"dallas",
"nacogdoches",
"delrio",
"elpaso",
"galveston",
"houston",
"killeen",
"laredo",
"lubbock",
"mcallen",
"odessa",
"sanangelo",
"sanantonio",
"sanmarcos",
"bigbend",
"texoma",
"easttexas",
"victoriatx",
"waco",
"wichitafalls",
]

# Set the URL of the Facebook Marketplace automotive category
base_url = (
"https://{}.craigslist.org/search/cta?min_auto_year={}#search=1~gallery~0~0"
)
return [base_url.format(city, oldestAllowedCars) for city in cities]


def getAllPosts(browser):
# Create a BeautifulSoup object from the HTML of the page
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")

# Find all of the car listings on the page
return soup.find_all("div", class_="gallery-card")


def getCarInfo(post):
title = post.find("span", class_="label").text

print(f'Scraping "{title}"')

price = post.find("span", class_="priceinfo").text
metadata = post.find("div", class_="meta").text.split("·")

odometer = metadata[1]
if len(metadata) >= 3:
location = metadata[2]

link = post.find("a", class_="posting-title", href=True)["href"]

imageElements = post.findAll("img")
images = [img["src"] for img in imageElements]

return title, price, location, odometer, link, images


def processAttributes(attributes):
processedAttributes = []

for attr in attributes:
[label, value] = attr.split(": ")
processedAttributes.append(
{"label": label.replace(" ", "-").lower(), "value": value}
)

return processedAttributes


def scrapeListing(url):
browser = utils.setupBrowser()

# Navigate to the URL
print(f"Going to {url}")
browser.get(url)

print(f"Loading page for {url}")
time.sleep(1)

# Create a BeautifulSoup object from the HTML of the page
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")

try:
description = soup.find("section", id="postingbody").text
attributes = processAttributes(
[
attr.text
for attr in soup.findAll("p", class_="attrgroup")[1].findAll("span")
]
)
map = soup.find("div", id="map")

car = {
"postBody": description,
"longitude": map["data-longitude"],
"latitude": map["data-latitude"],
}

for attr in attributes:
car[attr["label"]] = attr["value"]

return car
except Exception as e:
print(f"Failed scraping {url}: \n{e}")

# Close the Selenium WebDriver instance
browser.quit()
Loading
Loading