Scrapy source using scrapy #332

sultaniman · 2024-01-25T20:28:31Z

Hey,

This PR implements Scrapy source which set's up communication between threads via queue, see the image below

TODO

Write readme and manual,
README,
Demo with real scraping,
Another demo with more complex scraping,
Demo interaction with generate pipeline resource like using add_limit(...)
Tests

With the feedback from @rudolfix we decided to do the following changes

Hide the queue,
Remove source and resource in favor of dynamic creation,
Higher level run_scrapy_pipeline -> init => simply run pipeline and exit,
Higher level run_scrapy_pipeline to run parallel crawls,
Settings should be a param w/ default settings,
Remove pipeline item use signals,
Close the queue in the pipeline runner.

rudolfix

yeah, now with the closing queue implemented, you can simplify the code even simpler. please read the review. when it is done you'll be able to get rid of DltSpider and people will be able to use a standard spider here.

sources/scraping/helpers.py

sources/scraping/queue.py

sources/scraping/scrapy/pipeline_item.py

sources/scraping/scrapy/spider.py

sources/scraping/helpers.py

sources/scraping_pipeline.py

sources/scraping/helpers.py

sources/scraping/scrapy/pipeline_item.py

rudolfix

looks good, but still we need to simplify it. I'll ping you for the code walkthrough

sources/scraping/__init__.py

sources/scraping/scrapy/pipeline_item.py

sources/scraping/__init__.py

sources/scraping/helpers.py

sources/scraping/queue.py

sources/scraping/runner.py

Close queue Add requirements.txt Format code Cleanup code Fix linting issues Remove redundant config option Add revised README Add more docstring and cleanup sources Make api simpler Cleanup

Add logging and batch size configuration Cleanup code

Add tests Adjust assert Update README Add missing diagram file Update README Close queue when exiting Check if queue close is called Log number of batches Fix linting issues Fix linting issues Mark scrapy source Fix linting issue Format code Yield!

rudolfix

All looks good now but tests somehow do not end. Most probably a thread still executes that is not a daemon thread.
It all works each time when I run tests locally, also when I mix them other tests :/
My take:

try to stop reactor after the test
if it does not work then we skip all reactor tests

this is really good job! just the way scrapy works is a little bit obnoxious

sultaniman · 2024-03-05T09:59:50Z

@rudolfix I decided to mark to skip the test since we have other more pressing issues and left the todo item to do 10% research on this.

Tried to reproduce this locally by

running tests for scraping along with other tests,
running scraping test suite,
running particular test if it hangs,
swapping out a queue with custom test queue which terminates after N retries

sultaniman · 2024-03-05T15:29:21Z

Created a bug ticket: #394

rudolfix

LGTM!

Close queue Add requirements.txt Remove redundant config option Add revised README Make api simpler * Add batching of results Add logging and batch size configuration * Add pytest-mock and scrapy Close queue when exiting Check if queue close is called Log number of batches Fix linting issues Fix linting issues Mark scrapy source Fix linting issue Format code Yield! * Adjust tests * Add pytest-twisted * Add twisted to scrapy dependencies * Add twisted to dev dependencies * Add review comments * Add more checks and do not exit when queue is empty * Create QueueClosedError and handle in listener to exit loop * Simplify code * Stop crawling if queue is closed * Fix linting issues * Fix linting issues * Adjust tests and disable telnet server for scrapy * Remove pytest-twisted * Refactor scrapy item pipeline * Eliminate custom spider * Use pytest.mark.forked to run tests for ALL_DESTINATIONS * Add pytest-forked * Update lockfile * Use scrapy signals * Hide batching and retrieving logic inside queue * Add more types * Extend default scrapy settings * Extract pipeline and scrapy runners * Simplify helpers code * Cleanup code * Add start_urls_file configuration option * Sync scrapy log level with dlt log level * Expose simple scraping pipeline runner * Adjust config file * Connect signals in ScrapyRunner.init * Register source and do cleanups * Better scrapy setting passing and minor cleanups * Remove reduntant code comments * Call engine_stopped callback in finally block * Add more docstrings related to runners * Adjust batch size * Fix queue batching bugs * Pass crawler instance to item_scraped callback * Add advanced example to pipeline code * Access settings override for scrapy * Rewrite tests * Small readme update for bing wembaster * Adjust queue read timeout * Extract test utils for scraping source * Add stream generator to queue to handle generator exit exception * Extract singal registering and tearing down as context manager * Adjust and cleanup example pipeline source file * Cleanup scraping helpers * Adjust tests for scraping pipeline * Add callback access to scraping resource * Update readme * Cleanup code * Import ParamSpec from typing extensions * Fix linting issues * Fix linting issues * Set encoding when opening the file with urls * Adjust typing for scraping testing utils * Use proper Union syntax * Adjust mock patch module path for scraping tests * Use latest dlt version * Adjust mock patch module path for scraping tests * Adjust tests and mark ones to skip * Cleanup tests and utils for scraping source * Re-use spy on queue.close calls * Use append write_disposition by default for scraping source * Update test skip reason * Stop crawler manually * Return self from __call__ * Check if crawler.stop is actually called * Check if crawling has already been stopping * Test to verify resource name generation and override * Adjust resource name selection * Add more docstrings and update readme * Update readme * Add scrapy configuration in example pipeline * Shutdown twisted reactor after module tests * Use simple run_pipeline * Close the queue after timeout * Rewrite a comment and use break instead of return in while loop * Update comments * Mock queue with alternative implementation * Adjust mock patch path * Add logging when scrapy stops and re-arrange code actions * Stop crawler in on_engine_stopped * Call on_engine_stopped from on_item_scraped if the queue is closed * Skip test

* handles end_value and row_order in sql_database * drops dlt extract dependencies * test(filesystem): enable testing (#348) * enable testing * bump dlt version * pushes correct gdrive path * fixes formatting, removez az logs --------- Co-authored-by: Marcin Rudolf <[email protected]> * Update pull_request_template.md (#373) * fixes kinesis shard test (#390) * Fix: skip tests for sources without test accounts (#379) * skip facebook tests * skip matomo tests * skip personio tests * replace postgres to duckdb * return kafka, skip strapi * Scrapy source using scrapy (#332) Close queue Add requirements.txt Remove redundant config option Add revised README Make api simpler * Add batching of results Add logging and batch size configuration * Add pytest-mock and scrapy Close queue when exiting Check if queue close is called Log number of batches Fix linting issues Fix linting issues Mark scrapy source Fix linting issue Format code Yield! * Adjust tests * Add pytest-twisted * Add twisted to scrapy dependencies * Add twisted to dev dependencies * Add review comments * Add more checks and do not exit when queue is empty * Create QueueClosedError and handle in listener to exit loop * Simplify code * Stop crawling if queue is closed * Fix linting issues * Fix linting issues * Adjust tests and disable telnet server for scrapy * Remove pytest-twisted * Refactor scrapy item pipeline * Eliminate custom spider * Use pytest.mark.forked to run tests for ALL_DESTINATIONS * Add pytest-forked * Update lockfile * Use scrapy signals * Hide batching and retrieving logic inside queue * Add more types * Extend default scrapy settings * Extract pipeline and scrapy runners * Simplify helpers code * Cleanup code * Add start_urls_file configuration option * Sync scrapy log level with dlt log level * Expose simple scraping pipeline runner * Adjust config file * Connect signals in ScrapyRunner.init * Register source and do cleanups * Better scrapy setting passing and minor cleanups * Remove reduntant code comments * Call engine_stopped callback in finally block * Add more docstrings related to runners * Adjust batch size * Fix queue batching bugs * Pass crawler instance to item_scraped callback * Add advanced example to pipeline code * Access settings override for scrapy * Rewrite tests * Small readme update for bing wembaster * Adjust queue read timeout * Extract test utils for scraping source * Add stream generator to queue to handle generator exit exception * Extract singal registering and tearing down as context manager * Adjust and cleanup example pipeline source file * Cleanup scraping helpers * Adjust tests for scraping pipeline * Add callback access to scraping resource * Update readme * Cleanup code * Import ParamSpec from typing extensions * Fix linting issues * Fix linting issues * Set encoding when opening the file with urls * Adjust typing for scraping testing utils * Use proper Union syntax * Adjust mock patch module path for scraping tests * Use latest dlt version * Adjust mock patch module path for scraping tests * Adjust tests and mark ones to skip * Cleanup tests and utils for scraping source * Re-use spy on queue.close calls * Use append write_disposition by default for scraping source * Update test skip reason * Stop crawler manually * Return self from __call__ * Check if crawler.stop is actually called * Check if crawling has already been stopping * Test to verify resource name generation and override * Adjust resource name selection * Add more docstrings and update readme * Update readme * Add scrapy configuration in example pipeline * Shutdown twisted reactor after module tests * Use simple run_pipeline * Close the queue after timeout * Rewrite a comment and use break instead of return in while loop * Update comments * Mock queue with alternative implementation * Adjust mock patch path * Add logging when scrapy stops and re-arrange code actions * Stop crawler in on_engine_stopped * Call on_engine_stopped from on_item_scraped if the queue is closed * Skip test * rename template buttons (#340) * Rename new-verified-source.md to build-new-verified-source.md * Rename source-request.md to request-new-source.md * Update request-new-source.md * fixes references --------- Co-authored-by: Marcin Rudolf <[email protected]> * bumps dlt to 0.4.6 * fixes linter * fixes the empty primary key disabling incremental dedup * bumps to 0.4.7a0, does not overwrite [] as primary key * bumps dlt to 0.4.7 --------- Co-authored-by: Ilya Gurov <[email protected]> Co-authored-by: Anton Burnashev <[email protected]> Co-authored-by: Alena Astrakhantseva <[email protected]> Co-authored-by: Sultan Iman <[email protected]> Co-authored-by: adrianbr <[email protected]>

sultaniman self-assigned this Jan 25, 2024

sultaniman mentioned this pull request Jan 25, 2024

[WIP] Implement Scrapy source #320

Closed

3 tasks

sultaniman force-pushed the source/scaping branch 3 times, most recently from 860d39f to 4191967 Compare January 26, 2024 15:09

rudolfix requested changes Jan 28, 2024

View reviewed changes

sultaniman force-pushed the source/scaping branch from e98fde3 to 68f8a24 Compare February 1, 2024 09:48

sultaniman requested review from burnash, sh-rp and AstrakhantsevaAA February 1, 2024 10:29

sultaniman linked an issue Feb 1, 2024 that may be closed by this pull request

[web scraper] verified source #262

Closed

1 task

sultaniman commented Feb 5, 2024

View reviewed changes

sources/scraping/scrapy/pipeline_item.py Outdated Show resolved Hide resolved

sultaniman requested a review from rudolfix February 5, 2024 13:26

sultaniman force-pushed the source/scaping branch 2 times, most recently from 03202dc to 0166b93 Compare February 6, 2024 10:43

sultaniman force-pushed the source/scaping branch from 335d655 to 3b440d0 Compare February 12, 2024 17:01

rudolfix requested changes Feb 14, 2024

View reviewed changes

sources/scraping/__init__.py Outdated Show resolved Hide resolved

sources/scraping/scrapy/pipeline_item.py Outdated Show resolved Hide resolved

sources/scraping/__init__.py Outdated Show resolved Hide resolved

sources/scraping/helpers.py Show resolved Hide resolved

sultaniman commented Feb 22, 2024

View reviewed changes

sources/scraping/queue.py Show resolved Hide resolved

sultaniman commented Feb 22, 2024

View reviewed changes

sources/scraping/runner.py Show resolved Hide resolved

sultaniman commented Feb 22, 2024

View reviewed changes

sources/scraping/runner.py Outdated Show resolved Hide resolved

Sultan Iman added 10 commits February 26, 2024 13:32

Scrapy source using scrapy

4d28bd9

Close queue Add requirements.txt Format code Cleanup code Fix linting issues Remove redundant config option Add revised README Add more docstring and cleanup sources Make api simpler Cleanup

Add batching of results

ac63f7d

Add logging and batch size configuration Cleanup code

Add pytest-mock and scrapy

59fd5f9

Add tests Adjust assert Update README Add missing diagram file Update README Close queue when exiting Check if queue close is called Log number of batches Fix linting issues Fix linting issues Mark scrapy source Fix linting issue Format code Yield!

Adjust tests

d80446c

Add pytest-twisted

04d8841

Add twisted to scrapy dependencies

2741da8

Add twisted to dev dependencies

a685dae

Add review comments

78bcc6f

Add more checks and do not exit when queue is empty

3208ea8

Create QueueClosedError and handle in listener to exit loop

167fffa

Add scrapy configuration in example pipeline

c03e6ab

sultaniman requested a review from rudolfix March 1, 2024 11:41

sultaniman added the scraping label Mar 1, 2024

Adjust tests

2b4d64a

rudolfix requested changes Mar 3, 2024

View reviewed changes

sultaniman and others added 17 commits March 4, 2024 10:50

Shutdown twisted reactor after module tests

cad5924

Merge branch 'master' into source/scaping

b95c9fe

Update lockfile

cb9a8a8

Fix linting issues

49847a8

Use simple run_pipeline

65a5019

Close the queue after timeout

7bfdc80

Rewrite a comment and use break instead of return in while loop

d490620

Update comments

8667536

Mock queue with alternative implementation

7cadb93

Add docstrings and fix a bug in test queue

02e467b

Format code

ded3949

Debug test queue

56e8a54

Adjust mock patch path

2e587b9

Add logging when scrapy stops and re-arrange code actions

dd9d3f5

Stop crawler in on_engine_stopped

ed33d15

Call on_engine_stopped from on_item_scraped if the queue is closed

9ef03bf

Skip test

eb8278c

sultaniman requested a review from rudolfix March 5, 2024 09:57

rudolfix approved these changes Mar 5, 2024

View reviewed changes

rudolfix merged commit 53bcbaa into master Mar 5, 2024
14 checks passed

rudolfix deleted the source/scaping branch March 5, 2024 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapy source using scrapy #332

Scrapy source using scrapy #332

sultaniman commented Jan 25, 2024 •

edited

Loading

rudolfix left a comment

rudolfix left a comment

rudolfix left a comment

sultaniman commented Mar 5, 2024

sultaniman commented Mar 5, 2024

rudolfix left a comment

Scrapy source using scrapy #332

Scrapy source using scrapy #332

Conversation

sultaniman commented Jan 25, 2024 • edited Loading

TODO

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

sultaniman commented Mar 5, 2024

sultaniman commented Mar 5, 2024

rudolfix left a comment

Choose a reason for hiding this comment

sultaniman commented Jan 25, 2024 •

edited

Loading