Skip to content

Commit

Permalink
Add examples to crawlytics.running_crawls docstring
Browse files Browse the repository at this point in the history
  • Loading branch information
eliasdabbas committed Jul 15, 2024
1 parent 56d3bdb commit 5fb6a68
Show file tree
Hide file tree
Showing 94 changed files with 994 additions and 5,072 deletions.
37 changes: 36 additions & 1 deletion advertools/crawlytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,7 @@
"jl_to_parquet",
"parquet_columns",
"compare",
"running_crawls",
]


Expand Down Expand Up @@ -677,10 +678,44 @@ def running_crawls():
* elapsed: The elapsed time since the spider started.
* %mem: The percentage of memory that this spider is consuming.
* %cpu: The percentage of CPU that this spider is consuming.
* args: The full command that was used to start this spider. Use this to identify
* command: The command that was used to start this spider. Use this to identify
the spider(s) that you want to know about.
* output_file: The path to the output file for each running crawl job.
* crawled_urls: The current number of lines in ``output_file``.
Examples
--------
While a crawl is running:
>>> import advertools as adv
>>> adv.crawlytics.running_crawls()
==== ====== ========= ========= ====== ====== ========================================================================================================================================================================================================================================================================================================================================================================================================= ============= ==============
.. pid started elapsed %mem %cpu command output_file crawled_urls
==== ====== ========= ========= ====== ====== ========================================================================================================================================================================================================================================================================================================================================================================================================= ============= ==============
0 195720 21:41:14 00:11 1.1 103 /opt/tljh/user/bin/python /opt/tljh/user/bin/scrapy runspider /opt/tljh/user/lib/python3.10/site-packages/advertools/spider.py -a url_list=https://cnn.com -a allowed_domains=cnn.com -a follow_links=True -a exclude_url_params=None -a include_url_params=None -a exclude_url_regex=None -a include_url_regex=None -a css_selectors=None -a xpath_selectors=None -o cnn.jl -s CLOSESPIDER_PAGECOUNT=200 cnn.jl 30
==== ====== ========= ========= ====== ====== ========================================================================================================================================================================================================================================================================================================================================================================================================= ============= ==============
After a few moments:
>>> adv.crawlytics.running_crawls()
==== ====== ========= ========= ====== ====== ========================================================================================================================================================================================================================================================================================================================================================================================================= ============= ==============
.. pid started elapsed %mem %cpu command output_file crawled_urls
==== ====== ========= ========= ====== ====== ========================================================================================================================================================================================================================================================================================================================================================================================================= ============= ==============
0 195720 21:41:14 00:27 1.2 96.7 /opt/tljh/user/bin/python /opt/tljh/user/bin/scrapy runspider /opt/tljh/user/lib/python3.10/site-packages/advertools/spider.py -a url_list=https://cnn.com -a allowed_domains=cnn.com -a follow_links=True -a exclude_url_params=None -a include_url_params=None -a exclude_url_regex=None -a include_url_regex=None -a css_selectors=None -a xpath_selectors=None -o cnn.jl -s CLOSESPIDER_PAGECOUNT=200 cnn.jl 72
==== ====== ========= ========= ====== ====== ========================================================================================================================================================================================================================================================================================================================================================================================================= ============= ==============
After starting a new crawl:
>>> adv.crawlytics.running_crawls()
==== ====== ========= ========= ====== ====== ================================================================================================================================================================================================================================================================================================================================================================================================================= ============= ==============
.. pid started elapsed %mem %cpu command output_file crawled_urls
==== ====== ========= ========= ====== ====== ================================================================================================================================================================================================================================================================================================================================================================================================================= ============= ==============
0 195720 21:41:14 01:02 1.6 95.7 /opt/tljh/user/bin/python /opt/tljh/user/bin/scrapy runspider /opt/tljh/user/lib/python3.10/site-packages/advertools/spider.py -a url_list=https://cnn.com -a allowed_domains=cnn.com -a follow_links=True -a exclude_url_params=None -a include_url_params=None -a exclude_url_regex=None -a include_url_regex=None -a css_selectors=None -a xpath_selectors=None -o cnn.jl -s CLOSESPIDER_PAGECOUNT=200 cnn.jl 154
1 195769 21:42:09 00:07 0.4 83.8 /opt/tljh/user/bin/python /opt/tljh/user/bin/scrapy runspider /opt/tljh/user/lib/python3.10/site-packages/advertools/spider.py -a url_list=https://nytimes.com -a allowed_domains=nytimes.com -a follow_links=True -a exclude_url_params=None -a include_url_params=None -a exclude_url_regex=None -a include_url_regex=None -a css_selectors=None -a xpath_selectors=None -o nyt.jl -s CLOSESPIDER_PAGECOUNT=200 nyt.jl 17
==== ====== ========= ========= ====== ====== ================================================================================================================================================================================================================================================================================================================================================================================================================= ============= ==============
"""
if platform.system() == "Windows":
return "This is function does not support Windows yet. Will be, soon. Sorry!"
Expand Down
6 changes: 3 additions & 3 deletions advertools/image_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,15 @@
==== ============================================================================================== ==========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================
.. image_location image_urls
==== ============================================================================================== ==========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================
0 https://www.buzzfeed.com/hannahdobro/dirty-little-industry-secrets?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/user_images/6r1oxXOpC_large.jpg?downsize=120:*&output-format=jpg&output-quality=auto
0 https://www.buzzfeed.com/hannahdobro/dirty-little-industry-secrets?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/user_images/6r1oxXOpC_large.jpg?downsize=120:&output-format=jpg&output-quality=auto
0 https://www.buzzfeed.com/hannahdobro/dirty-little-industry-secrets?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/2024-03/18/16/asset/fce856744ed8/sub-buzz-1303-1710779249-1.jpg
0 https://www.buzzfeed.com/hannahdobro/dirty-little-industry-secrets?origin=tuh data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
0 https://www.buzzfeed.com/hannahdobro/dirty-little-industry-secrets?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/2024-03/18/16/asset/245ecfa321e9/sub-buzz-894-1710779358-1.jpg
1 https://www.buzzfeed.com/chelseastewart/josh-peck-statement-drake-bell-abuse-claims?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/2017-12/12/13/user_images/buzzfeed-prod-web-03/chelseastewart-v2-5590-1513102854-0_large.jpg?downsize=120:*&output-format=jpg&output-quality=auto
1 https://www.buzzfeed.com/chelseastewart/josh-peck-statement-drake-bell-abuse-claims?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/2017-12/12/13/user_images/buzzfeed-prod-web-03/chelseastewart-v2-5590-1513102854-0_large.jpg?downsize=120:&output-format=jpg&output-quality=auto
1 https://www.buzzfeed.com/chelseastewart/josh-peck-statement-drake-bell-abuse-claims?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/2024-03/21/19/asset/ea6298160040/sub-buzz-1093-1711048323-1.jpg?downsize=700%3A%2A&output-quality=auto&output-format=auto
1 https://www.buzzfeed.com/chelseastewart/josh-peck-statement-drake-bell-abuse-claims?origin=tuh data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
1 https://www.buzzfeed.com/chelseastewart/josh-peck-statement-drake-bell-abuse-claims?origin=tuh data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFQAAAA7CAMAAADSF118AAAAP1BMVEUAAADIGxPOHBK5EwDFGhi5Fwi8GRTEGhe7EQDMHR7////vyMfddnm5Hx334+Py8fHdj5DLVVXnq6zJOTzVbG1s8SkwAAAACXRSTlMAv4Eo10JnqA8IHfydAAABJUlEQVRYw93Y64rCMBCG4czk5FSzdav3f63bDaxfV4Qm+AXR96/wMNj0kLhtPib9LcutYA8K+F1rKXqH4KmIPZVIOvwnszEqumFjMVLB3+YsRiv8zRqMWHa1ZNQiBuUV3Jo3cn5FlY3qimY2KitajB3+UmLRxRGovgmqTj4HXc69aN5Hj9PcyYqzfXSavk58tJMNTWgv24pW9kpE0fGbioKlomCZKNgLEUXLhYiiMx+dT+xJ8SxgoCDZ6EJcp7jsPBQLlIbiVmpEwy7aS1poeZ30PvqlAQVJRGeQtLfp1dBLPyb0bdDER+OYL2nHR7E34yUjtjw6ZMc3am/KXlSpoodCHrQWiWbxI85Q6Kc9pneHSCmHJ0VJGPPuAC3LWqO/OURL0aEfg76m8Izrt6EAAAAASUVORK5CYII=
2 https://www.buzzfeed.com/josephlongo/celebs-wearing-rewearing-same-dress?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/2021-06/3/16/user_images/a824550933a9/tomiobaro-v2-2174-1622738336-41_large.jpg?downsize=120:*&output-format=jpg&output-quality=auto
2 https://www.buzzfeed.com/josephlongo/celebs-wearing-rewearing-same-dress?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/2021-06/3/16/user_images/a824550933a9/tomiobaro-v2-2174-1622738336-41_large.jpg?downsize=120:&output-format=jpg&output-quality=auto
2 https://www.buzzfeed.com/josephlongo/celebs-wearing-rewearing-same-dress?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/2024-03/19/13/asset/6634db63f453/sub-buzz-576-1710855734-6.jpg?downsize=700%3A%2A&output-quality=auto&output-format=auto
2 https://www.buzzfeed.com/josephlongo/celebs-wearing-rewearing-same-dress?origin=tuh https://img.buzzfeed.com/buzzfeed-static/static/2024-03/19/13/asset/cb8db05df7e7/sub-buzz-1743-1710855790-4.jpg
2 https://www.buzzfeed.com/josephlongo/celebs-wearing-rewearing-same-dress?origin=tuh data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
Expand Down
3 changes: 2 additions & 1 deletion advertools/sitemaps.py
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,7 @@
.. code-block::
:class: thebe, thebe-init
adv.sitemap_to_df("https://www.ft.com/sitemaps/news.xml", headers={"User-agent": "YOUR-USER-AGENT"})
Another interesting thing you might want to do is utilize the `If-None-Match` header.
Expand Down Expand Up @@ -524,7 +525,7 @@ def sitemap_to_df(sitemap_url, max_workers=8, recursive=True, request_headers=No
case you want to explore what sitemaps are available
after which you can decide which ones you are
interested in.
:param dict request_headers: One or more request headers to use while
:param dict request_headers: One or more request headers to use while
fetching the sitemap.
:return sitemap_df: A pandas DataFrame containing all URLs, as well as
other tags if available (``lastmod``, ``changefreq``,
Expand Down
Binary file modified docs/_build/doctrees/advertools.ad_create.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.ad_from_string.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.crawlytics.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.emoji.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.extract.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.header_spider.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.image_spider.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.knowledge_graph.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.kw_generate.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.logs.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.reverse_dns_lookup.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.robotstxt.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.serp.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.sitemaps.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.spider.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.twitter.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.url_builders.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.urlytics.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.word_frequency.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.word_tokenize.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/advertools.youtube.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/_build/doctrees/include_changelog.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 6d44d7111b7f1716309db46f215f3039
config: 9050954f32c83d4b0f75fbb50189c8b7
tags: 645f666f9bcd5a90fca523b33c5a78b7
6 changes: 3 additions & 3 deletions docs/_build/html/_modules/advertools/ad_create.html
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@

<script src="../../_static/jquery.js?v=5d32c60e"></script>
<script src="../../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="../../_static/documentation_options.js?v=0dc20415"></script>
<script src="../../_static/doctools.js?v=888ff710"></script>
<script src="../../_static/documentation_options.js?v=ef97403f"></script>
<script src="../../_static/doctools.js?v=9a2dae69"></script>
<script src="../../_static/sphinx_highlight.js?v=dc90522c"></script>
<script>const THEBE_JS_URL = "https://unpkg.com/[email protected]/lib/index.js"; const thebe_selector = ".thebe,.cell"; const thebe_selector_input = "pre"; const thebe_selector_output = ".output, .cell_output"</script>
<script async="async" src="../../_static/sphinx-thebe.js?v=c100c467"></script>
Expand All @@ -39,7 +39,7 @@
advertools
</a>
<div class="version">
0.14.2
0.15.0
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
Expand Down
6 changes: 3 additions & 3 deletions docs/_build/html/_modules/advertools/ad_from_string.html
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@

<script src="../../_static/jquery.js?v=5d32c60e"></script>
<script src="../../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="../../_static/documentation_options.js?v=0dc20415"></script>
<script src="../../_static/doctools.js?v=888ff710"></script>
<script src="../../_static/documentation_options.js?v=ef97403f"></script>
<script src="../../_static/doctools.js?v=9a2dae69"></script>
<script src="../../_static/sphinx_highlight.js?v=dc90522c"></script>
<script>const THEBE_JS_URL = "https://unpkg.com/[email protected]/lib/index.js"; const thebe_selector = ".thebe,.cell"; const thebe_selector_input = "pre"; const thebe_selector_output = ".output, .cell_output"</script>
<script async="async" src="../../_static/sphinx-thebe.js?v=c100c467"></script>
Expand All @@ -39,7 +39,7 @@
advertools
</a>
<div class="version">
0.14.2
0.15.0
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
Expand Down
Loading

0 comments on commit 5fb6a68

Please sign in to comment.