Luis/toy webscrapper #68

LDiazN · 2021-04-19T17:12:14Z

Important Changes

Added scrapy as dependency
Added beutifulsoup4 as dependency
Added scraper module in c4v-py/src

Problem

Create a toy scraper version to get basic data for El Pitazo, also being able to expand it with more implementations for other sources.

Proposed solution

expose a public scrape(url : str) -> ScrapedData that receives an url, and scrapes data for that url if possible, raise an error otherwhise.
Create a spiders.py file with valid implementations of scrapers for different sources (There's currently just one, for El Pitazo)
Create a settings.py file with mappings from domains to spiders, so we can choose which implementation works for each, defering domain-level logic to that spider.

Tasks

Create scraper spider for ElPitazo, working only for article urls

marianelamin · 2021-04-20T03:59:52Z

no need to push .lock file, since it can be generated with poetry install or update.

src/scraper/settings.py

marianelamin · 2021-04-20T14:34:45Z

src/scraper/scraper.py

+
+_process = CrawlerProcess(CRAWLER_SETTINGS)
+
+def scrape(url : str) -> List[dict]:


Could you create unit tests for the scrape function?

From the design docs, we are aiming for given an URL return the scraped data str. A Dict would make more sense so we are able to scale, but I don't see why we would be having a list of dictionaries, just from one url.

I want to, I just don't know how yet, since I need to find a way to test a web scraper and decoupling tests from the http request to the page being scraped. I have a question, where should I write the test once they've been wrote? With nox tests? or in its own file/folder within the scraper module?

We have been adding our tests (not necessarily unit tests) under tests/data folder since our existing modules are in src/data. Like mirroring the src folder structure.

Put the tests in their own folder, for now, since the scraper is in /src/bla-bla-scraper, put the tests in a tests folder at the top level (e.g. /tests/bla-bla-specs). This way, our tests use our library—instead of being part of the library.

My two cents regarding testing. You do not test the scraper (Scrapy already does that), and you do not test that the webpage works—because websites are fragile and monitoring that scrapers work and alerting somebody is not part of the scope of this phase.

You should test the expected behavior of all the code that you wrote. By thinking hard about testing the implemented code, you decouple external modules from the core ones. For example, by decoupling the HTTP adapter and the Scraping Engine from our scraping code itself. Remember that great testability is a symptom of a great design.

That way, we can mock parts of the Scrapy framework, HTTP responses, or any outsider code. Only if it's needed that is :)

However, this is hard to do! Try not to get frustrated if you are struggling. It's normal.

+1 for the tests, I would also suggest that instead of writing the tests in src/scraper/tests, you write them in /tests

asciidiego · 2021-04-20T15:26:48Z

no need to push .lock file, since it can be generated with poetry install or update.

I think we should (purposely) use lock files. After all, they represent the latest successful build of dependencies of the project. The lock file is safer since it is more specific than just a version number and, therefore, we should always commit it.

asciidiego · 2021-04-20T15:29:43Z

Just a reminder: whenever the PR is ready for review, remove its draft status, and (I think) it will appear on my Github notification dashboard.

VKorelsky · 2021-04-20T18:49:00Z

no need to push .lock file, since it can be generated with poetry install or update.

I think we should (purposely) use lock files. After all, they represent the latest successful build of dependencies of the project. The lock file is safer since it is more specific than just a version number and, therefore, we should always commit it.

Agree with this, lock files are usually committed :)

…r core library

…quality. Compare with mega string copied by hand from orginal page?

VKorelsky · 2021-04-21T23:34:53Z

src/scraper/tests/el_pitazo_test.py

+    scraper = ElPitazoScraper()
+    parse_output = scraper.parse(response)
+
+    # body = ????? TODO still don't know how to test body equality


I think that for the body test, you'll have to copy a super long string and use that
different ways to make the test a bit nicer to read in this case are:

define a method in the same test file that returns the expected body as a string, that way you just say

assert parse_output['title'] == getExpectedBody() ... def getExpectedBody(): "super long string that you expect to be scraped from the website....."

define a method as above, but put the expected body in a different text file and read from it

assert parse_output['title'] == getExpectedBody() ... def getExpectedBody(): # <code that will go open a text file and read it as a string>

I would suggest the first approach

VKorelsky · 2021-04-21T23:35:03Z

src/scraper/tests/el_pitazo_test.py

+def test_title_parse_ok():
+    url = "tests/html_bodies/el_pitazo_fallas_electricas_carne.html"
+    test_file = os.path.join(ROOT_DIR, url)
+    response = fake_response_from_file(test_file, "https://elpitazo.net/cronicas/la-fallas-electricas-han-disminuido-la-cantidad-de-carne-que-consume-el-venezolano/")


nice 👍🏽

As I said to Luis, I would not even test that the website is correctly parsed. Why? Website layouts are fragile, and fragile tests are bad tests.

There needs to be a separation of concern: one thing is parsing correctly, and another one is using the parser correctly. We need to test the latter because that's the purpose of a spider—among other things. Creating these kinds of tests for one spider is simple, but as websites are added and they change, it will be a hell to maintain.

Instead, we should (eventually) prepare the system for monitoring and alerts, but I would not worry about that—for now.

Why? Website layouts are fragile, and fragile tests are bad tests.

In my opinion, if the code is expected to go read a webpage and extract information from it - there should be at least a test asserting that if the website looks one way, then X information is extracted from it. Not sure how else you would test that your scraper is working correctly on one version of the intended website. When the website changes layout, then we can change the test html we've saved for that version.

If way more websites are added and this becomes complex, then we can refactor our approach.

What other approaches do you suggest for testing this?

and another one is using the parser correctly. We need to test the latter because that's the purpose of a spider—among other things.

not sure I get what you mean by using the parser correctly - how do you suggest this is tested?

Avoiding testing fragile code.

Just like you would (usually) not test a UI. Integration and E2E tests are expensive and a hell to maintain because they are fragile. I would apply the Humble Object pattern. Make the parser humble: its only job is to execute a parsing function (that has already been tested, e.g. beautifulsoup, lxml, ...). We make sure that the parser works (pass simple HTML strings with different selectors) and then move on.

The interesting logic we want to check is not whether something has been correctly configured or not. We want to assert that our program can configure itself based on the inputs (now that is easy to test). For example, URL starts with pitazo? Pitazo scraper it is. The Pitazo scraper does not return anything because the selectors are wrong? Well, that's the problem of the configuration, not the class. The class works. Its behavior is correct. No need for file fixtures. No need to update both code and tests every time a single div changes or when a CSS/XPath selector is not general enough.

When the website changes its layout (how are you going to know that it changed?), our tests must not break. An alert system should. We should not couple a whole Python library to a website CSS and HTML. Let alone multiple.

In my opinion, if the code is expected to go read a webpage and extract information from it [...]

The code is expected to configure a spider appropriately based on the URL. It might be expected to clean the output a bit assuming a correct parsing. It is expected to do something when no input was found (because of connection errors or simple bad CSS selectors or layout changes). For example, just write an empty string to a Dict.

But, I do not expect the code to extract information from a webpage with 100% accuracy just like I do not expect an NLP algorithm to classify text perfectly. That's why we should validate our scraping engine with a monitoring system just like a Machine Learning Pipeline is constantly being evaluated and re-trained. We can only test deterministic algorithms. Scraping something that is not under our control is not one of those.

Caveat: the main reason why I would not test fragile things (configuration and layout) is because of the hell that is to maintain. Even for a single spider. However, there have been tries to fix this issue even by Scrapy maintainers themselves. But, in my opinion, it's an uphill battle. There are alternatives, but it looks like the cure is worse than the disease^1,2.

[1] https://github.com/scrapinghub/scrapy-autounit#caveats
[2] https://docs.scrapy.org/en/latest/topics/contracts.html

When the website changes its layout (how are you going to know that it changed?), our tests must not break.

Our tests won't break, since that one sample page is stored locally as an html file in the project.
In the case of a mostly static website, where most pages have the same layout (chatting with Luis, looks like el pitazo is mostly on this line) - I think the value of the integration test will outweigh the maintenance burden of it (which should be very little).

If that website changes (and all pages adopt a new layout) - then we'll have to update our parser anyways. (I'm with you on the fact that an alerting system should warn us of this, not our tests). Once we're aware the site changed, then we'd take a snapshot of what the new layout (hopefully still as static) is like for an article page, and update our test with it.

Of course, in the case of a more complicated website, or one where the set up is more complex - the overhead of maintenance increases a lot and the overall value therefore decreases. In which case I agree with you more we shouldn't aim to write such tests.

Now there may be some scrappy specifics that make this impractical - about these I'm all ears :)

My end impression is simply that for a website whose pages are static and follow the same layout - there is more value than cost in this kind of snapshot based integration test.

VKorelsky · 2021-04-21T23:35:27Z

src/scraper/spider_manager.py

+from typing import List
+
+
+class SpiderManager:


this class will need some unit tests too :)

I'll have a look at how to test this, given that it has the CrawlerProcess + spider things.
@m4r4c00ch0 maybe you would have some input here?

What's exactly the purpose of this class? I don't like the word Manager because it looks like this module might become a God Object. But, I might be wrong. Is this a Mediator?

p.s. I am used to writing tests before programming, so this might need refactoring before we can test it easily. I would give you my advice on how to unit test this class, but first I need to know its purpose.

It's purpose is to implement scraping and parsing functions for spiders, so they are not directly managed by a BaseScraper subclass. This way, BaseScrapyScraper can just delegate that behavior on the manager object. Maybe it's better to merge that logic into BaseScrapyScraper? Since the latter is just redirecting behavior into SpiderManager.

VKorelsky · 2021-04-21T23:44:38Z

src/scraper/scrapers/base_scraper.py

+from typing import List
+
+
+class BaseScraper:


VKorelsky · 2021-04-21T23:50:30Z

src/scraper/scraper.py

+    for known_url in URL_TO_SCRAPER.keys():
+        if known_url in url:
+            return URL_TO_SCRAPER[known_url]


here for efficiency I would suggest:

validating that the URL is valid

extracting the domain name from the URL

using the domain name as the lookup key in the map

In order to validate the URL, the two easier options I found are:

Using a regex like this from django

Using a validators library like this

What approach do you think is the best? I would prefer to the first one since it does not requires a new dependency

VKorelsky · 2021-04-21T23:51:59Z

src/scraper/scraper.py

+from .settings import URL_TO_SCRAPER
+
+
+def scrape(url: str) -> dict:


these methods will need some unit tests as well :)

VKorelsky · 2021-04-21T23:52:56Z

src/scraper/utils.py

+from bs4 import BeautifulSoup
+
+
+def clean(element: str) -> str:


Suggested change

def clean(element: str) -> str:

def strip_html_tags(element: str) -> str:

better to go with a name that matches exactly what this does, clean seems too broad in my opinion.
What do you think?

You're totally right, there's no good reason to leave it as just "clean".

advice: after you write tests for this function/module, you'll (probably) have a better guess of what the best name would be.

VKorelsky · 2021-04-21T23:53:24Z

src/scraper/utils.py

@@ -0,0 +1,42 @@
+"""
+    Misc. helper functions.


also, tests for this :)

asciidiego · 2021-04-22T20:39:34Z

src/scraper/spider_manager.py

+
+    def __init__(self, spider) -> None:
+        self.spider = spider
+        pass


why is there a pass here?

Oh, it shouldn't be there, I just forgot delete it. Thanks for notice.

asciidiego · 2021-04-22T20:40:16Z

src/scraper/spider_manager.py

+from typing import List
+
+
+class SpiderManager:


What's exactly the purpose of this class? I don't like the word Manager because it looks like this module might become a God Object. But, I might be wrong. Is this a Mediator?

p.s. I am used to writing tests before programming, so this might need refactoring before we can test it easily. I would give you my advice on how to unit test this class, but first I need to know its purpose.

asciidiego · 2021-04-22T20:45:32Z

src/scraper/tests/el_pitazo_test.py

+def test_title_parse_ok():
+    url = "tests/html_bodies/el_pitazo_fallas_electricas_carne.html"
+    test_file = os.path.join(ROOT_DIR, url)
+    response = fake_response_from_file(test_file, "https://elpitazo.net/cronicas/la-fallas-electricas-han-disminuido-la-cantidad-de-carne-que-consume-el-venezolano/")


As I said to Luis, I would not even test that the website is correctly parsed. Why? Website layouts are fragile, and fragile tests are bad tests.

There needs to be a separation of concern: one thing is parsing correctly, and another one is using the parser correctly. We need to test the latter because that's the purpose of a spider—among other things. Creating these kinds of tests for one spider is simple, but as websites are added and they change, it will be a hell to maintain.

Instead, we should (eventually) prepare the system for monitoring and alerts, but I would not worry about that—for now.

asciidiego · 2021-04-22T20:46:45Z

src/scraper/utils.py

+from bs4 import BeautifulSoup
+
+
+def clean(element: str) -> str:


advice: after you write tests for this function/module, you'll (probably) have a better guess of what the best name would be.

asciidiego · 2021-04-22T20:48:20Z

src/scraper/scrapers/base_scrapy_scraper.py

+        self._spider_manager = SpiderManager(self.spider)
+
+    def parse(self, response) -> dict:
+        # Delegate parse function to scrapy implementation


I would remove these redundant comments. I can already see that the function it's delegating its scrape implementation to the manager.

…d function to get domain from url; optimized map from url to scraper implementation

Edilmo

LGTM

Nice work.
Whenever @VKorelsky and/or @m4r4c00ch0 think is ok, go ahead.

My comments could be addressed in separate tasks/PRs, up to you.

Edilmo · 2021-04-24T16:44:02Z

src/scraper/scraper.py

@@ -0,0 +1,80 @@
+"""


The whole package is misplaced.
By convention, we would like everything inside our c4v-py project be imported like this:

import c4v.scraper.X ... from c4v.scraper import X

and with this you would have:

import from scraper.X ... from scraper import X

The latter is the pythonic way to organize things, the former would be a library with more than one global package, which is extremely rare at least for me).

I recommend to stick to the normal case.

Edilmo · 2021-04-25T19:05:45Z

src/scraper/scraper.py

+
+# Python imports
+from typing import List, Type
+


I know that this is the first step and because of that is ok for me if the following observation is handle in a follow-up task/PR.

Currently we are returning a Dict, but I encourage you to start as soon as possible to work with explicit typed data structures.
More precisely I would like to see a hierarchy of DataClasses representing outputs. And this hierarchy, up to some extent, should be aligned with the current hierarchy you already put in place in this PR.
And, if can be more strict and picky, I would like to see the current hierarchy of classes being defined as Generics which are parametrized by the output DataClasses.
Another important observation is that those DataClasses should be seeing as Inmutable objects. They could have methods performing computations, but they should avoid mutating the object itself. That design should help you to make everything more friendly with pipelining of tasks and future parallelizations.

Edilmo · 2021-04-25T19:09:37Z

src/scraper/scrapers/base_scraper.py

+        Base class for scrapers implementations
+    """
+
+    def parse(self, response) -> dict:


NIT: I noticed that you are following the typing which is great.
If we would like to be strict, the typing here has two mistakes:

response has no type: maybe is because Scrapy does not have type hints. Is that right?

dict is not a valid type hint. The typing module defined a type called Dict. That one is Generic and a typical annotation example is like this:

def parse(self, response: MyType) -> Dict[str, Any]:

About 1.
No, it does have type hints, we decided to keep it as "any" type so every scraper could have its own response type according to its needs. But I'll agree that we should have a base class in order to have a minimal interface for every response object, and (using scrapy as an example) a ScrapyResponse implementation for scrapy Response object could just delegate the interface implementation to scrapy functions, for example:

class BaseResponse: def url(self) -> str: # just an example pass class ScrapyResponse(BaseResponse): def __init__(self, scrapy_response : scrapy.http.Response): self.scrapy_response = scrapy_response def url(self) -> str: return self.scrapy_response.url()

I'll be addressing this issue in some other PR
About 2.
You're right, I should change every dict signature. I wasn't sure which one I should have used when I first wrote it. I'll change it ASAP

Edilmo · 2021-04-25T19:15:24Z

src/scraper/scrapy_settings.py

+"""
+
+# Settings passed to the crawler
+CRAWLER_SETTINGS = {"LOG_ENABLED": False}


I'm glad to see that you guys are already considering configuration settings and at the same time keeping it simple!!!

I like that, please create a task for me when you get time to bring a configuration library into the project.
here the one I'm thinking to use: https://www.dynaconf.com
cc: @dieko95

Edilmo · 2021-04-25T19:26:09Z

src/scraper/spider_manager.py

+
+        # start scrapping
+        self.process.crawl(crawler)
+        self.process.start()


This is not a blocking call, is it?

If so, I'm not sure about this API. The doc is saying that you return the scrapped data, but start in the crawler process is intended as an async pattern and you need to use the join method. But maybe I'm wrong.

Edilmo · 2021-04-25T19:27:03Z

src/scraper/spiders/el_pitazo.py

+        categories = response.css(".tdb-entry-category").getall()
+        categories = list(map(utils.strip_http_tags, categories))
+
+        return {


This is related to my previous comment about using typed objects for outputs. Or more precisely, a hierarchy of DataClasses.

Edilmo · 2021-04-25T19:29:21Z

src/scraper/spiders/el_pitazo.py

+        body = filter(lambda p: p.startswith("<p>") and p.endswith("</p>"), body)
+        body = map(utils.strip_http_tags, body)
+
+        body = "\n".join(body)


Logics like this are suitable to be in the data class as an computed property. That will allow you to reuse this little amount of code easily. I know that is small code, but once you are scrapping 100 sources, one line of repeated code becomes 100 lines of sparse repeated code. And if you need to change it, then it will be a pain.

Edilmo · 2021-04-25T19:30:59Z

src/scraper/tests/el_pitazo_test.py

+
+
+def get_body_for_parse_ok():
+    return "Ganaderos de la zona del Sur del Lago de Maracaibo, área estratégica potencial para el desarrollo agropecuario de la nación, reportaron una disminución en el sacrificio de bovinos debido a los incesantes racionamientos eléctricos.\nAsí lo informó Daniel Ariza, presidente de la Asociación de Ganaderos y Agricultores del municipio Colón (Aganaco), quien aseguró que la medida se tomó en los frigoríficos: Frigorifico Industrial Sur del Lago (Frisulca), Frigorífico Catatutmo (Fricasa) y Frigorífico Industrial Santa Bárbara (Fibasa), para evitar acumulación de reses en canal. “Solo se ejecuta matanza en algunas unidades de producción y los índices descienden con el paso de los días”.\nPrecisó que aun cuando algunos frigoríficos cuentan con plantas eléctricas, estos carecen de combustible, transporte y logística para mantenerlas encendidas todos los días, tal como lo ameritan los cárnicos en cadena de frío. Solo hasta el 8 de abril, Corpoelec diseñó un plan que ha divulgado a través de las redes sociales sin vocería del Gobierno y donde incluye a los 12 circuitos de la subregión en bloques de hasta seis horas diarias de racionamiento.\n“Muchos de los ganaderos dejan de matar animales a raíz de esta situación tan difícil; fue una manera de controlar la pérdida de carne de res, de cerdo y otros animales que acá se producen, manteniéndolos vivos y destinarlos al sacrificio solo cuando Corpoelec asome una posible estabilidad en los circuitos”.\nTambién en las carnicerías optaron por bajar la compra de cárnicos. Los alimentos en estos expendios tienden a perderse con más facilidad, porque esta cadena de comercialización no ha invertido en un plan alternativo para mantener la vida útil de las cavas o cuartos fríos.\nPara el primer apagón, los carniceros se valieron de amigos y productores que resguardaron parte de la carne. Otra la vendieron a bajo costo y una minoría regaló la disponibilidad en las exhibidoras.\nMientras esto ocurre, la carne de res en las expendedoras locales disminuyó y con ello el consumo de proteína animal en los hogares venezolanos. Pero no solo se trata de la carne: la medida incluye los huesos, vísceras o las piezas duras como piel y cuernos del animal.\nLa producción de leche también va en retroceso y descendió desde marzo. Armando Chacín, presidente de la Federación Nacional de Ganaderos de Venezuela (Fedenaga), indicó a\xa0El Pitazo\xa0que durante los primeros cinco días del primer apagón, registrado por un incidente en Guri el 7 de marzo reciente, la pérdida de producción diaria nacional de leche cruda se ubicó en un millón 150.000 litros por día. Sin embargo, luego de tomarse “medidas urgentes”, el gremio logró reducir la cifra.\nUn litro de leche a puerta de planta es cancelado en mil bolívares. Según la federación,\xa0la tercera parte de la producción de leche es destinada a derivados como chicha, choco y demás subproductos que no ameritan pasteurización. Manifestó que quesos y demás preparaciones dependen del lácteo que de igual manera ameritan los neonatos en fase de crecimiento.\nDijo que los 120.000 productores que arriman el lácteo a las distintas manufactureras, procesadoras, pasteurizadoras y queseras debieron paralizar una parte del ordeño en sus unidades de producción.\n“Se originó una situación imprevista. Buena parte del rebaño nacional dejó de ser ordeñado y las vacas fueron destinadas a amamantar becerros.\nNuestra preocupación es que estos animales se van a secar porque les están dando la leche a sus crías y esto amerita que el ordeño se vuelva a realizar hasta un nuevo parto, en un lapso de nueve meses”, estimó Chacín.\nEsta organización resumió las consecuencias de estar tantas horas sin luz en el sector agropecuario de la región en la pérdida de 100.000 litros de leche. Aseguran que el sector ha sido uno de los más golpeados por la falta de electricidad.\nNo se habían recuperado de las consecuencias del primer apagón cuando el segundo interrumpió la cadena de producción. Sin fluido eléctrico es imposible enfriar y mantener en buen estado la leche; las pocas empresas que poseen plantas no operan sin combustible, elemento muy escaso en el municipio Machiques, en donde se dificulta cada día el suministro de gasolina y gasoil. Como consecuencia de esta escasez, también se dificulta el transporte de leche, y por ende la elaboración del queso y su comercialización."


let's move all this big string to files with resources.
you could have a CSV which can be load as fixture in the conftests. Or a JSON. Or anything that will be simple enough to maintain and that helps you to remove clutter out of the code.

Edilmo · 2021-04-25T19:32:23Z

src/scraper/tests/el_pitazo_test.py

+    """
+        Check that ElPitazoScraper parses a valid page as expected
+    """
+    url = "tests/html_bodies/el_pitazo_fallas_electricas_carne.html"


let's try to put anything that is a resource inside a resources folder. That is more pythonic. Or perform a quick search for recommended practices for resources in pytest. I think moving the html_bodies folder inside a resources folder should be enough.

Edilmo · 2021-04-25T19:32:56Z

src/scraper/tests/spider_manager_test.py

+
+
+def get_body_for_parse_ok():
+    return "Ganaderos de la zona del Sur del Lago de Maracaibo, área estratégica potencial para el desarrollo agropecuario de la nación, reportaron una disminución en el sacrificio de bovinos debido a los incesantes racionamientos eléctricos.\nAsí lo informó Daniel Ariza, presidente de la Asociación de Ganaderos y Agricultores del municipio Colón (Aganaco), quien aseguró que la medida se tomó en los frigoríficos: Frigorifico Industrial Sur del Lago (Frisulca), Frigorífico Catatutmo (Fricasa) y Frigorífico Industrial Santa Bárbara (Fibasa), para evitar acumulación de reses en canal. “Solo se ejecuta matanza en algunas unidades de producción y los índices descienden con el paso de los días”.\nPrecisó que aun cuando algunos frigoríficos cuentan con plantas eléctricas, estos carecen de combustible, transporte y logística para mantenerlas encendidas todos los días, tal como lo ameritan los cárnicos en cadena de frío. Solo hasta el 8 de abril, Corpoelec diseñó un plan que ha divulgado a través de las redes sociales sin vocería del Gobierno y donde incluye a los 12 circuitos de la subregión en bloques de hasta seis horas diarias de racionamiento.\n“Muchos de los ganaderos dejan de matar animales a raíz de esta situación tan difícil; fue una manera de controlar la pérdida de carne de res, de cerdo y otros animales que acá se producen, manteniéndolos vivos y destinarlos al sacrificio solo cuando Corpoelec asome una posible estabilidad en los circuitos”.\nTambién en las carnicerías optaron por bajar la compra de cárnicos. Los alimentos en estos expendios tienden a perderse con más facilidad, porque esta cadena de comercialización no ha invertido en un plan alternativo para mantener la vida útil de las cavas o cuartos fríos.\nPara el primer apagón, los carniceros se valieron de amigos y productores que resguardaron parte de la carne. Otra la vendieron a bajo costo y una minoría regaló la disponibilidad en las exhibidoras.\nMientras esto ocurre, la carne de res en las expendedoras locales disminuyó y con ello el consumo de proteína animal en los hogares venezolanos. Pero no solo se trata de la carne: la medida incluye los huesos, vísceras o las piezas duras como piel y cuernos del animal.\nLa producción de leche también va en retroceso y descendió desde marzo. Armando Chacín, presidente de la Federación Nacional de Ganaderos de Venezuela (Fedenaga), indicó a\xa0El Pitazo\xa0que durante los primeros cinco días del primer apagón, registrado por un incidente en Guri el 7 de marzo reciente, la pérdida de producción diaria nacional de leche cruda se ubicó en un millón 150.000 litros por día. Sin embargo, luego de tomarse “medidas urgentes”, el gremio logró reducir la cifra.\nUn litro de leche a puerta de planta es cancelado en mil bolívares. Según la federación,\xa0la tercera parte de la producción de leche es destinada a derivados como chicha, choco y demás subproductos que no ameritan pasteurización. Manifestó que quesos y demás preparaciones dependen del lácteo que de igual manera ameritan los neonatos en fase de crecimiento.\nDijo que los 120.000 productores que arriman el lácteo a las distintas manufactureras, procesadoras, pasteurizadoras y queseras debieron paralizar una parte del ordeño en sus unidades de producción.\n“Se originó una situación imprevista. Buena parte del rebaño nacional dejó de ser ordeñado y las vacas fueron destinadas a amamantar becerros.\nNuestra preocupación es que estos animales se van a secar porque les están dando la leche a sus crías y esto amerita que el ordeño se vuelva a realizar hasta un nuevo parto, en un lapso de nueve meses”, estimó Chacín.\nEsta organización resumió las consecuencias de estar tantas horas sin luz en el sector agropecuario de la región en la pérdida de 100.000 litros de leche. Aseguran que el sector ha sido uno de los más golpeados por la falta de electricidad.\nNo se habían recuperado de las consecuencias del primer apagón cuando el segundo interrumpió la cadena de producción. Sin fluido eléctrico es imposible enfriar y mantener en buen estado la leche; las pocas empresas que poseen plantas no operan sin combustible, elemento muy escaso en el municipio Machiques, en donde se dificulta cada día el suministro de gasolina y gasoil. Como consecuencia de esta escasez, también se dificulta el transporte de leche, y por ende la elaboración del queso y su comercialización."


same coment here, let's use a resource asset for this long strings

marianelamin · 2021-05-02T04:16:44Z

src/c4v/scraper/spiders/el_pitazo.py

+        """
+            Get article body as a single string
+        """
+        body = response.css("#bsf_rt_marker > p").getall()


you can use add ::text, to let the query know you only want the text within those elements selected. By doing this you can, probably, omit the following two lines of code, letting scrapy clean any html tags that may exist within the retrieved body. And somewhat guarantees to have unicode strings (which we are currently not getting).
body = response.css("#bsf_rt_marker > p ::text").getall()

This might be done on a different PR, maybe as a bonus on #73.

…h scrapy

marianelamin · 2021-05-03T21:10:01Z

src/c4v/scraper/spiders/el_pitazo.py

+        tags = self._get_tags(response)
+
+        # categories
+        categories = response.css(".tdb-entry-category")


nit: this line can simply be
categories = response.css(".tdb-entry-category ::text").getall()
this returns a list with the text found on each element of class "tdb-entry-category"

marianelamin · 2021-05-03T21:10:50Z

src/c4v/scraper/spiders/el_pitazo.py

+        """
+            Try to get tags from document if available
+        """
+        tags = response.css(".tdb-tags > li > a")


Same comment for categories applies here,

…ectors

LDiazN added 7 commits April 19, 2021 12:09

added initial version of web scrapper

e1e2b7e

changed dependencies to include scrappy and beutifulsoup

f88440f

module rename

4efb8a4

renamed module in settings file

7e8a05e

fixed typo in pyproject.toml

e475d6c

added lock file changes

796e6e6

changed scrape function so it includes url in returned dict list

dcccb13

marianelamin reviewed Apr 20, 2021

View reviewed changes

src/scraper/settings.py Outdated Show resolved Hide resolved

marianelamin reviewed Apr 20, 2021

View reviewed changes

marianelamin requested review from VKorelsky, dieko95, Edilmo and asciidiego April 20, 2021 14:35

This was linked to issues Apr 20, 2021

Create toy scraper for main sources #67

Open

Create a Scraper Manager #65

Open

Define the Scraper interface #66

Open

This was unlinked from issues Apr 20, 2021

Create a Scraper Manager #65

Open

Define the Scraper interface #66

Open

Create toy scraper for main sources #67

Open

reformatted

ca34920

LDiazN added 4 commits April 20, 2021 18:22

added initial spider class to help us decouple scrapy objects from ou…

4126bc3

…r core library

refactor to decouple our core library from scrapy

98a3c80

removing not-working tests

d33af66

added initial version of tests, I still don't know how to test body e…

a65ee41

…quality. Compare with mega string copied by hand from orginal page?

VKorelsky reviewed Apr 21, 2021

View reviewed changes

added more unit tests; fixed scrape function for el pitazo

46c27d7

asciidiego reviewed Apr 22, 2021

View reviewed changes

added additional tests; added functions to check url validation; adde…

34499a0

…d function to get domain from url; optimized map from url to scraper implementation

Edilmo approved these changes Apr 25, 2021

View reviewed changes

LDiazN added 3 commits April 26, 2021 15:31

changed type hinting mistake with dict signatures

7ac403e

moved scraper to c4v folder

1eeda1f

fixed import issue after moving module

65d86c9

VKorelsky approved these changes Apr 27, 2021

View reviewed changes

marianelamin assigned LDiazN May 2, 2021

marianelamin mentioned this pull request May 2, 2021

Move big string resources folder #73

Open

marianelamin reviewed May 2, 2021

View reviewed changes

marianelamin mentioned this pull request May 2, 2021

Create toy scraper for main sources #67

Open

11 tasks

marianelamin approved these changes May 2, 2021

View reviewed changes

moved from beutifulsoup html tags cleaning to css ::text cleaning wit…

43c707a

…h scrapy

marianelamin reviewed May 3, 2021

View reviewed changes

marianelamin mentioned this pull request May 3, 2021

Implement scraper for: primicia.com.ve #74

Open

changed categories selectors in el pitazo spiders to use only css sel…

b0b94d7

…ectors

marianelamin approved these changes May 8, 2021

View reviewed changes

updated poetry

32ead25

LDiazN marked this pull request as ready for review May 17, 2021 19:12

LDiazN merged commit 02e7935 into master May 17, 2021

LDiazN deleted the luis/toy-webscrapper branch May 17, 2021 19:12


		_process = CrawlerProcess(CRAWLER_SETTINGS)

		def scrape(url : str) -> List[dict]:

		from .settings import URL_TO_SCRAPER


		def scrape(url: str) -> dict:

		from bs4 import BeautifulSoup


		def clean(element: str) -> str:

	def clean(element: str) -> str:
	def strip_html_tags(element: str) -> str:



		def get_body_for_parse_ok():
		return "Ganaderos de la zona del Sur del Lago de Maracaibo, área estratégica potencial para el desarrollo agropecuario de la nación, reportaron una disminución en el sacrificio de bovinos debido a los incesantes racionamientos eléctricos.\nAsí lo informó Daniel Ariza, presidente de la Asociación de Ganaderos y Agricultores del municipio Colón (Aganaco), quien aseguró que la medida se tomó en los frigoríficos: Frigorifico Industrial Sur del Lago (Frisulca), Frigorífico Catatutmo (Fricasa) y Frigorífico Industrial Santa Bárbara (Fibasa), para evitar acumulación de reses en canal. “Solo se ejecuta matanza en algunas unidades de producción y los índices descienden con el paso de los días”.\nPrecisó que aun cuando algunos frigoríficos cuentan con plantas eléctricas, estos carecen de combustible, transporte y logística para mantenerlas encendidas todos los días, tal como lo ameritan los cárnicos en cadena de frío. Solo hasta el 8 de abril, Corpoelec diseñó un plan que ha divulgado a través de las redes sociales sin vocería del Gobierno y donde incluye a los 12 circuitos de la subregión en bloques de hasta seis horas diarias de racionamiento.\n“Muchos de los ganaderos dejan de matar animales a raíz de esta situación tan difícil; fue una manera de controlar la pérdida de carne de res, de cerdo y otros animales que acá se producen, manteniéndolos vivos y destinarlos al sacrificio solo cuando Corpoelec asome una posible estabilidad en los circuitos”.\nTambién en las carnicerías optaron por bajar la compra de cárnicos. Los alimentos en estos expendios tienden a perderse con más facilidad, porque esta cadena de comercialización no ha invertido en un plan alternativo para mantener la vida útil de las cavas o cuartos fríos.\nPara el primer apagón, los carniceros se valieron de amigos y productores que resguardaron parte de la carne. Otra la vendieron a bajo costo y una minoría regaló la disponibilidad en las exhibidoras.\nMientras esto ocurre, la carne de res en las expendedoras locales disminuyó y con ello el consumo de proteína animal en los hogares venezolanos. Pero no solo se trata de la carne: la medida incluye los huesos, vísceras o las piezas duras como piel y cuernos del animal.\nLa producción de leche también va en retroceso y descendió desde marzo. Armando Chacín, presidente de la Federación Nacional de Ganaderos de Venezuela (Fedenaga), indicó a\xa0El Pitazo\xa0que durante los primeros cinco días del primer apagón, registrado por un incidente en Guri el 7 de marzo reciente, la pérdida de producción diaria nacional de leche cruda se ubicó en un millón 150.000 litros por día. Sin embargo, luego de tomarse “medidas urgentes”, el gremio logró reducir la cifra.\nUn litro de leche a puerta de planta es cancelado en mil bolívares. Según la federación,\xa0la tercera parte de la producción de leche es destinada a derivados como chicha, choco y demás subproductos que no ameritan pasteurización. Manifestó que quesos y demás preparaciones dependen del lácteo que de igual manera ameritan los neonatos en fase de crecimiento.\nDijo que los 120.000 productores que arriman el lácteo a las distintas manufactureras, procesadoras, pasteurizadoras y queseras debieron paralizar una parte del ordeño en sus unidades de producción.\n“Se originó una situación imprevista. Buena parte del rebaño nacional dejó de ser ordeñado y las vacas fueron destinadas a amamantar becerros.\nNuestra preocupación es que estos animales se van a secar porque les están dando la leche a sus crías y esto amerita que el ordeño se vuelva a realizar hasta un nuevo parto, en un lapso de nueve meses”, estimó Chacín.\nEsta organización resumió las consecuencias de estar tantas horas sin luz en el sector agropecuario de la región en la pérdida de 100.000 litros de leche. Aseguran que el sector ha sido uno de los más golpeados por la falta de electricidad.\nNo se habían recuperado de las consecuencias del primer apagón cuando el segundo interrumpió la cadena de producción. Sin fluido eléctrico es imposible enfriar y mantener en buen estado la leche; las pocas empresas que poseen plantas no operan sin combustible, elemento muy escaso en el municipio Machiques, en donde se dificulta cada día el suministro de gasolina y gasoil. Como consecuencia de esta escasez, también se dificulta el transporte de leche, y por ende la elaboración del queso y su comercialización."

Luis/toy webscrapper #68

Luis/toy webscrapper #68

Conversation

LDiazN commented Apr 19, 2021

Important Changes

Problem

Proposed solution

Tasks

marianelamin commented Apr 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asciidiego commented Apr 20, 2021

asciidiego commented Apr 20, 2021

VKorelsky commented Apr 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VKorelsky Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

asciidiego Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

asciidiego Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

VKorelsky Apr 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Edilmo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VKorelsky Apr 22, 2021 •

edited

Loading

asciidiego Apr 22, 2021 •

edited

Loading

asciidiego Apr 22, 2021 •

edited

Loading

VKorelsky Apr 28, 2021 •

edited

Loading