-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Luis/toy webscrapper #68
Conversation
no need to push .lock file, since it can be generated with poetry install or update. |
src/scraper/scraper.py
Outdated
|
||
_process = CrawlerProcess(CRAWLER_SETTINGS) | ||
|
||
def scrape(url : str) -> List[dict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you create unit tests for the scrape function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the design docs, we are aiming for given an URL return the scraped data str
. A Dict would make more sense so we are able to scale, but I don't see why we would be having a list of dictionaries, just from one url.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to, I just don't know how yet, since I need to find a way to test a web scraper and decoupling tests from the http request to the page being scraped. I have a question, where should I write the test once they've been wrote? With nox tests? or in its own file/folder within the scraper module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been adding our tests (not necessarily unit tests) under tests/data
folder since our existing modules are in src/data
. Like mirroring the src
folder structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the tests in their own folder, for now, since the scraper is in /src/bla-bla-scraper
, put the tests in a tests
folder at the top level (e.g. /tests/bla-bla-specs
). This way, our tests use our library—instead of being part of the library.
My two cents regarding testing. You do not test the scraper (Scrapy already does that), and you do not test that the webpage works—because websites are fragile and monitoring that scrapers work and alerting somebody is not part of the scope of this phase.
You should test the expected behavior of all the code that you wrote. By thinking hard about testing the implemented code, you decouple external modules from the core ones. For example, by decoupling the HTTP adapter and the Scraping Engine from our scraping code itself. Remember that great testability is a symptom of a great design.
That way, we can mock parts of the Scrapy framework, HTTP responses, or any outsider code. Only if it's needed that is :)
However, this is hard to do! Try not to get frustrated if you are struggling. It's normal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the tests, I would also suggest that instead of writing the tests in src/scraper/tests
, you write them in /tests
I think we should (purposely) use lock files. After all, they represent the latest successful build of dependencies of the project. The lock file is safer since it is more specific than just a version number and, therefore, we should always commit it. |
Just a reminder: whenever the PR is ready for review, remove its draft status, and (I think) it will appear on my Github notification dashboard. |
Agree with this, lock files are usually committed :) |
…quality. Compare with mega string copied by hand from orginal page?
src/scraper/tests/el_pitazo_test.py
Outdated
scraper = ElPitazoScraper() | ||
parse_output = scraper.parse(response) | ||
|
||
# body = ????? TODO still don't know how to test body equality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for the body test, you'll have to copy a super long string and use that
different ways to make the test a bit nicer to read in this case are:
- define a method in the same test file that returns the expected body as a string, that way you just say
assert parse_output['title'] == getExpectedBody()
...
def getExpectedBody():
"super long string that you expect to be scraped from the website....."
- define a method as above, but put the expected body in a different text file and read from it
assert parse_output['title'] == getExpectedBody()
...
def getExpectedBody():
# <code that will go open a text file and read it as a string>
I would suggest the first approach
src/scraper/tests/el_pitazo_test.py
Outdated
def test_title_parse_ok(): | ||
url = "tests/html_bodies/el_pitazo_fallas_electricas_carne.html" | ||
test_file = os.path.join(ROOT_DIR, url) | ||
response = fake_response_from_file(test_file, "https://elpitazo.net/cronicas/la-fallas-electricas-han-disminuido-la-cantidad-de-carne-que-consume-el-venezolano/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice 👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said to Luis, I would not even test that the website is correctly parsed. Why? Website layouts are fragile, and fragile tests are bad tests.
There needs to be a separation of concern: one thing is parsing correctly, and another one is using the parser correctly. We need to test the latter because that's the purpose of a spider—among other things. Creating these kinds of tests for one spider is simple, but as websites are added and they change, it will be a hell to maintain.
Instead, we should (eventually) prepare the system for monitoring and alerts, but I would not worry about that—for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Website layouts are fragile, and fragile tests are bad tests.
In my opinion, if the code is expected to go read a webpage and extract information from it - there should be at least a test asserting that if the website looks one way, then X information is extracted from it. Not sure how else you would test that your scraper is working correctly on one version of the intended website. When the website changes layout, then we can change the test html we've saved for that version.
If way more websites are added and this becomes complex, then we can refactor our approach.
What other approaches do you suggest for testing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and another one is using the parser correctly. We need to test the latter because that's the purpose of a spider—among other things.
not sure I get what you mean by using the parser correctly
- how do you suggest this is tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding testing fragile code.
Just like you would (usually) not test a UI. Integration and E2E tests are expensive and a hell to maintain because they are fragile. I would apply the Humble Object pattern. Make the parser humble: its only job is to execute a parsing function (that has already been tested, e.g. beautifulsoup, lxml, ...). We make sure that the parser works (pass simple HTML strings with different selectors) and then move on.
The interesting logic we want to check is not whether something has been correctly configured or not. We want to assert that our program can configure itself based on the inputs (now that is easy to test). For example, URL starts with pitazo? Pitazo scraper it is. The Pitazo scraper does not return anything because the selectors are wrong? Well, that's the problem of the configuration, not the class. The class works. Its behavior is correct. No need for file fixtures. No need to update both code and tests every time a single div
changes or when a CSS/XPath selector is not general enough.
When the website changes its layout (how are you going to know that it changed?), our tests must not break. An alert system should. We should not couple a whole Python library to a website CSS and HTML. Let alone multiple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, if the code is expected to go read a webpage and extract information from it [...]
The code is expected to configure a spider appropriately based on the URL. It might be expected to clean the output a bit assuming a correct parsing. It is expected to do something when no input was found (because of connection errors or simple bad CSS selectors or layout changes). For example, just write an empty string to a Dict.
But, I do not expect the code to extract information from a webpage with 100% accuracy just like I do not expect an NLP algorithm to classify text perfectly. That's why we should validate our scraping engine with a monitoring system just like a Machine Learning Pipeline is constantly being evaluated and re-trained. We can only test deterministic algorithms. Scraping something that is not under our control is not one of those.
Caveat: the main reason why I would not test fragile things (configuration and layout) is because of the hell that is to maintain. Even for a single spider. However, there have been tries to fix this issue even by Scrapy maintainers themselves. But, in my opinion, it's an uphill battle. There are alternatives, but it looks like the cure is worse than the disease1,2.
[1] https://github.com/scrapinghub/scrapy-autounit#caveats
[2] https://docs.scrapy.org/en/latest/topics/contracts.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the website changes its layout (how are you going to know that it changed?), our tests must not break.
Our tests won't break, since that one sample page is stored locally as an html file in the project.
In the case of a mostly static website, where most pages have the same layout (chatting with Luis, looks like el pitazo is mostly on this line) - I think the value of the integration test will outweigh the maintenance burden of it (which should be very little).
If that website changes (and all pages adopt a new layout) - then we'll have to update our parser anyways. (I'm with you on the fact that an alerting system should warn us of this, not our tests). Once we're aware the site changed, then we'd take a snapshot of what the new layout (hopefully still as static) is like for an article page, and update our test with it.
Of course, in the case of a more complicated website, or one where the set up is more complex - the overhead of maintenance increases a lot and the overall value therefore decreases. In which case I agree with you more we shouldn't aim to write such tests.
Now there may be some scrappy specifics that make this impractical - about these I'm all ears :)
My end impression is simply that for a website whose pages are static and follow the same layout - there is more value than cost in this kind of snapshot based integration test.
src/scraper/spider_manager.py
Outdated
from typing import List | ||
|
||
|
||
class SpiderManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class will need some unit tests too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at how to test this, given that it has the CrawlerProcess + spider things.
@m4r4c00ch0 maybe you would have some input here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's exactly the purpose of this class? I don't like the word Manager because it looks like this module might become a God Object. But, I might be wrong. Is this a Mediator?
p.s. I am used to writing tests before programming, so this might need refactoring before we can test it easily. I would give you my advice on how to unit test this class, but first I need to know its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's purpose is to implement scraping and parsing functions for spiders, so they are not directly managed by a BaseScraper subclass. This way, BaseScrapyScraper can just delegate that behavior on the manager object. Maybe it's better to merge that logic into BaseScrapyScraper? Since the latter is just redirecting behavior into SpiderManager.
src/scraper/scrapers/base_scraper.py
Outdated
from typing import List | ||
|
||
|
||
class BaseScraper: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
src/scraper/scraper.py
Outdated
for known_url in URL_TO_SCRAPER.keys(): | ||
if known_url in url: | ||
return URL_TO_SCRAPER[known_url] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here for efficiency I would suggest:
- validating that the URL is valid
- extracting the domain name from the URL
- using the domain name as the lookup key in the map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/scraper/scraper.py
Outdated
from .settings import URL_TO_SCRAPER | ||
|
||
|
||
def scrape(url: str) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these methods will need some unit tests as well :)
src/scraper/utils.py
Outdated
from bs4 import BeautifulSoup | ||
|
||
|
||
def clean(element: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def clean(element: str) -> str: | |
def strip_html_tags(element: str) -> str: |
better to go with a name that matches exactly what this does, clean seems too broad in my opinion.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're totally right, there's no good reason to leave it as just "clean".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
advice: after you write tests for this function/module, you'll (probably) have a better guess of what the best name would be.
src/scraper/utils.py
Outdated
@@ -0,0 +1,42 @@ | |||
""" | |||
Misc. helper functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, tests for this :)
src/scraper/spider_manager.py
Outdated
|
||
def __init__(self, spider) -> None: | ||
self.spider = spider | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is there a pass here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it shouldn't be there, I just forgot delete it. Thanks for notice.
src/scraper/spider_manager.py
Outdated
from typing import List | ||
|
||
|
||
class SpiderManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's exactly the purpose of this class? I don't like the word Manager because it looks like this module might become a God Object. But, I might be wrong. Is this a Mediator?
p.s. I am used to writing tests before programming, so this might need refactoring before we can test it easily. I would give you my advice on how to unit test this class, but first I need to know its purpose.
src/scraper/tests/el_pitazo_test.py
Outdated
def test_title_parse_ok(): | ||
url = "tests/html_bodies/el_pitazo_fallas_electricas_carne.html" | ||
test_file = os.path.join(ROOT_DIR, url) | ||
response = fake_response_from_file(test_file, "https://elpitazo.net/cronicas/la-fallas-electricas-han-disminuido-la-cantidad-de-carne-que-consume-el-venezolano/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said to Luis, I would not even test that the website is correctly parsed. Why? Website layouts are fragile, and fragile tests are bad tests.
There needs to be a separation of concern: one thing is parsing correctly, and another one is using the parser correctly. We need to test the latter because that's the purpose of a spider—among other things. Creating these kinds of tests for one spider is simple, but as websites are added and they change, it will be a hell to maintain.
Instead, we should (eventually) prepare the system for monitoring and alerts, but I would not worry about that—for now.
src/scraper/utils.py
Outdated
from bs4 import BeautifulSoup | ||
|
||
|
||
def clean(element: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
advice: after you write tests for this function/module, you'll (probably) have a better guess of what the best name would be.
self._spider_manager = SpiderManager(self.spider) | ||
|
||
def parse(self, response) -> dict: | ||
# Delegate parse function to scrapy implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove these redundant comments. I can already see that the function it's delegating its scrape implementation to the manager.
…d function to get domain from url; optimized map from url to scraper implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice work.
Whenever @VKorelsky and/or @m4r4c00ch0 think is ok, go ahead.
My comments could be addressed in separate tasks/PRs, up to you.
src/scraper/scraper.py
Outdated
@@ -0,0 +1,80 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole package is misplaced.
By convention, we would like everything inside our c4v-py project be imported like this:
import c4v.scraper.X
...
from c4v.scraper import X
and with this you would have:
import from scraper.X
...
from scraper import X
The latter is the pythonic way to organize things, the former would be a library with more than one global package, which is extremely rare at least for me).
I recommend to stick to the normal case.
src/scraper/scraper.py
Outdated
|
||
# Python imports | ||
from typing import List, Type | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this is the first step and because of that is ok for me if the following observation is handle in a follow-up task/PR.
Currently we are returning a Dict, but I encourage you to start as soon as possible to work with explicit typed data structures.
More precisely I would like to see a hierarchy of DataClasses representing outputs. And this hierarchy, up to some extent, should be aligned with the current hierarchy you already put in place in this PR.
And, if can be more strict and picky, I would like to see the current hierarchy of classes being defined as Generics which are parametrized by the output DataClasses.
Another important observation is that those DataClasses should be seeing as Inmutable objects. They could have methods performing computations, but they should avoid mutating the object itself. That design should help you to make everything more friendly with pipelining of tasks and future parallelizations.
src/scraper/scrapers/base_scraper.py
Outdated
Base class for scrapers implementations | ||
""" | ||
|
||
def parse(self, response) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I noticed that you are following the typing which is great.
If we would like to be strict, the typing here has two mistakes:
- response has no type: maybe is because Scrapy does not have type hints. Is that right?
- dict is not a valid type hint. The
typing
module defined a type calledDict
. That one is Generic and a typical annotation example is like this:
def parse(self, response: MyType) -> Dict[str, Any]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About 1.
No, it does have type hints, we decided to keep it as "any" type so every scraper could have its own response type according to its needs. But I'll agree that we should have a base class in order to have a minimal interface for every response object, and (using scrapy as an example) a ScrapyResponse implementation for scrapy Response object could just delegate the interface implementation to scrapy functions, for example:
class BaseResponse:
def url(self) -> str: # just an example
pass
class ScrapyResponse(BaseResponse):
def __init__(self, scrapy_response : scrapy.http.Response):
self.scrapy_response = scrapy_response
def url(self) -> str:
return self.scrapy_response.url()
I'll be addressing this issue in some other PR
About 2.
You're right, I should change every dict signature. I wasn't sure which one I should have used when I first wrote it. I'll change it ASAP
src/scraper/scrapy_settings.py
Outdated
""" | ||
|
||
# Settings passed to the crawler | ||
CRAWLER_SETTINGS = {"LOG_ENABLED": False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to see that you guys are already considering configuration settings and at the same time keeping it simple!!!
I like that, please create a task for me when you get time to bring a configuration library into the project.
here the one I'm thinking to use: https://www.dynaconf.com
cc: @dieko95
src/scraper/spider_manager.py
Outdated
|
||
# start scrapping | ||
self.process.crawl(crawler) | ||
self.process.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a blocking call, is it?
If so, I'm not sure about this API. The doc is saying that you return the scrapped data, but start in the crawler process is intended as an async pattern and you need to use the join method. But maybe I'm wrong.
src/scraper/spiders/el_pitazo.py
Outdated
categories = response.css(".tdb-entry-category").getall() | ||
categories = list(map(utils.strip_http_tags, categories)) | ||
|
||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to my previous comment about using typed objects for outputs. Or more precisely, a hierarchy of DataClasses.
src/scraper/spiders/el_pitazo.py
Outdated
body = filter(lambda p: p.startswith("<p>") and p.endswith("</p>"), body) | ||
body = map(utils.strip_http_tags, body) | ||
|
||
body = "\n".join(body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logics like this are suitable to be in the data class as an computed property. That will allow you to reuse this little amount of code easily. I know that is small code, but once you are scrapping 100 sources, one line of repeated code becomes 100 lines of sparse repeated code. And if you need to change it, then it will be a pain.
src/scraper/tests/el_pitazo_test.py
Outdated
|
||
|
||
def get_body_for_parse_ok(): | ||
return "Ganaderos de la zona del Sur del Lago de Maracaibo, área estratégica potencial para el desarrollo agropecuario de la nación, reportaron una disminución en el sacrificio de bovinos debido a los incesantes racionamientos eléctricos.\nAsí lo informó Daniel Ariza, presidente de la Asociación de Ganaderos y Agricultores del municipio Colón (Aganaco), quien aseguró que la medida se tomó en los frigoríficos: Frigorifico Industrial Sur del Lago (Frisulca), Frigorífico Catatutmo (Fricasa) y Frigorífico Industrial Santa Bárbara (Fibasa), para evitar acumulación de reses en canal. “Solo se ejecuta matanza en algunas unidades de producción y los índices descienden con el paso de los días”.\nPrecisó que aun cuando algunos frigoríficos cuentan con plantas eléctricas, estos carecen de combustible, transporte y logística para mantenerlas encendidas todos los días, tal como lo ameritan los cárnicos en cadena de frío. Solo hasta el 8 de abril, Corpoelec diseñó un plan que ha divulgado a través de las redes sociales sin vocería del Gobierno y donde incluye a los 12 circuitos de la subregión en bloques de hasta seis horas diarias de racionamiento.\n“Muchos de los ganaderos dejan de matar animales a raíz de esta situación tan difícil; fue una manera de controlar la pérdida de carne de res, de cerdo y otros animales que acá se producen, manteniéndolos vivos y destinarlos al sacrificio solo cuando Corpoelec asome una posible estabilidad en los circuitos”.\nTambién en las carnicerías optaron por bajar la compra de cárnicos. Los alimentos en estos expendios tienden a perderse con más facilidad, porque esta cadena de comercialización no ha invertido en un plan alternativo para mantener la vida útil de las cavas o cuartos fríos.\nPara el primer apagón, los carniceros se valieron de amigos y productores que resguardaron parte de la carne. Otra la vendieron a bajo costo y una minoría regaló la disponibilidad en las exhibidoras.\nMientras esto ocurre, la carne de res en las expendedoras locales disminuyó y con ello el consumo de proteína animal en los hogares venezolanos. Pero no solo se trata de la carne: la medida incluye los huesos, vísceras o las piezas duras como piel y cuernos del animal.\nLa producción de leche también va en retroceso y descendió desde marzo. Armando Chacín, presidente de la Federación Nacional de Ganaderos de Venezuela (Fedenaga), indicó a\xa0El Pitazo\xa0que durante los primeros cinco días del primer apagón, registrado por un incidente en Guri el 7 de marzo reciente, la pérdida de producción diaria nacional de leche cruda se ubicó en un millón 150.000 litros por día. Sin embargo, luego de tomarse “medidas urgentes”, el gremio logró reducir la cifra.\nUn litro de leche a puerta de planta es cancelado en mil bolívares. Según la federación,\xa0la tercera parte de la producción de leche es destinada a derivados como chicha, choco y demás subproductos que no ameritan pasteurización. Manifestó que quesos y demás preparaciones dependen del lácteo que de igual manera ameritan los neonatos en fase de crecimiento.\nDijo que los 120.000 productores que arriman el lácteo a las distintas manufactureras, procesadoras, pasteurizadoras y queseras debieron paralizar una parte del ordeño en sus unidades de producción.\n“Se originó una situación imprevista. Buena parte del rebaño nacional dejó de ser ordeñado y las vacas fueron destinadas a amamantar becerros.\nNuestra preocupación es que estos animales se van a secar porque les están dando la leche a sus crías y esto amerita que el ordeño se vuelva a realizar hasta un nuevo parto, en un lapso de nueve meses”, estimó Chacín.\nEsta organización resumió las consecuencias de estar tantas horas sin luz en el sector agropecuario de la región en la pérdida de 100.000 litros de leche. Aseguran que el sector ha sido uno de los más golpeados por la falta de electricidad.\nNo se habían recuperado de las consecuencias del primer apagón cuando el segundo interrumpió la cadena de producción. Sin fluido eléctrico es imposible enfriar y mantener en buen estado la leche; las pocas empresas que poseen plantas no operan sin combustible, elemento muy escaso en el municipio Machiques, en donde se dificulta cada día el suministro de gasolina y gasoil. Como consecuencia de esta escasez, también se dificulta el transporte de leche, y por ende la elaboración del queso y su comercialización." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move all this big string to files with resources.
you could have a CSV which can be load as fixture in the conftests. Or a JSON. Or anything that will be simple enough to maintain and that helps you to remove clutter out of the code.
src/scraper/tests/el_pitazo_test.py
Outdated
""" | ||
Check that ElPitazoScraper parses a valid page as expected | ||
""" | ||
url = "tests/html_bodies/el_pitazo_fallas_electricas_carne.html" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's try to put anything that is a resource inside a resources folder. That is more pythonic. Or perform a quick search for recommended practices for resources in pytest. I think moving the html_bodies folder inside a resources folder should be enough.
|
||
|
||
def get_body_for_parse_ok(): | ||
return "Ganaderos de la zona del Sur del Lago de Maracaibo, área estratégica potencial para el desarrollo agropecuario de la nación, reportaron una disminución en el sacrificio de bovinos debido a los incesantes racionamientos eléctricos.\nAsí lo informó Daniel Ariza, presidente de la Asociación de Ganaderos y Agricultores del municipio Colón (Aganaco), quien aseguró que la medida se tomó en los frigoríficos: Frigorifico Industrial Sur del Lago (Frisulca), Frigorífico Catatutmo (Fricasa) y Frigorífico Industrial Santa Bárbara (Fibasa), para evitar acumulación de reses en canal. “Solo se ejecuta matanza en algunas unidades de producción y los índices descienden con el paso de los días”.\nPrecisó que aun cuando algunos frigoríficos cuentan con plantas eléctricas, estos carecen de combustible, transporte y logística para mantenerlas encendidas todos los días, tal como lo ameritan los cárnicos en cadena de frío. Solo hasta el 8 de abril, Corpoelec diseñó un plan que ha divulgado a través de las redes sociales sin vocería del Gobierno y donde incluye a los 12 circuitos de la subregión en bloques de hasta seis horas diarias de racionamiento.\n“Muchos de los ganaderos dejan de matar animales a raíz de esta situación tan difícil; fue una manera de controlar la pérdida de carne de res, de cerdo y otros animales que acá se producen, manteniéndolos vivos y destinarlos al sacrificio solo cuando Corpoelec asome una posible estabilidad en los circuitos”.\nTambién en las carnicerías optaron por bajar la compra de cárnicos. Los alimentos en estos expendios tienden a perderse con más facilidad, porque esta cadena de comercialización no ha invertido en un plan alternativo para mantener la vida útil de las cavas o cuartos fríos.\nPara el primer apagón, los carniceros se valieron de amigos y productores que resguardaron parte de la carne. Otra la vendieron a bajo costo y una minoría regaló la disponibilidad en las exhibidoras.\nMientras esto ocurre, la carne de res en las expendedoras locales disminuyó y con ello el consumo de proteína animal en los hogares venezolanos. Pero no solo se trata de la carne: la medida incluye los huesos, vísceras o las piezas duras como piel y cuernos del animal.\nLa producción de leche también va en retroceso y descendió desde marzo. Armando Chacín, presidente de la Federación Nacional de Ganaderos de Venezuela (Fedenaga), indicó a\xa0El Pitazo\xa0que durante los primeros cinco días del primer apagón, registrado por un incidente en Guri el 7 de marzo reciente, la pérdida de producción diaria nacional de leche cruda se ubicó en un millón 150.000 litros por día. Sin embargo, luego de tomarse “medidas urgentes”, el gremio logró reducir la cifra.\nUn litro de leche a puerta de planta es cancelado en mil bolívares. Según la federación,\xa0la tercera parte de la producción de leche es destinada a derivados como chicha, choco y demás subproductos que no ameritan pasteurización. Manifestó que quesos y demás preparaciones dependen del lácteo que de igual manera ameritan los neonatos en fase de crecimiento.\nDijo que los 120.000 productores que arriman el lácteo a las distintas manufactureras, procesadoras, pasteurizadoras y queseras debieron paralizar una parte del ordeño en sus unidades de producción.\n“Se originó una situación imprevista. Buena parte del rebaño nacional dejó de ser ordeñado y las vacas fueron destinadas a amamantar becerros.\nNuestra preocupación es que estos animales se van a secar porque les están dando la leche a sus crías y esto amerita que el ordeño se vuelva a realizar hasta un nuevo parto, en un lapso de nueve meses”, estimó Chacín.\nEsta organización resumió las consecuencias de estar tantas horas sin luz en el sector agropecuario de la región en la pérdida de 100.000 litros de leche. Aseguran que el sector ha sido uno de los más golpeados por la falta de electricidad.\nNo se habían recuperado de las consecuencias del primer apagón cuando el segundo interrumpió la cadena de producción. Sin fluido eléctrico es imposible enfriar y mantener en buen estado la leche; las pocas empresas que poseen plantas no operan sin combustible, elemento muy escaso en el municipio Machiques, en donde se dificulta cada día el suministro de gasolina y gasoil. Como consecuencia de esta escasez, también se dificulta el transporte de leche, y por ende la elaboración del queso y su comercialización." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same coment here, let's use a resource asset for this long strings
""" | ||
Get article body as a single string | ||
""" | ||
body = response.css("#bsf_rt_marker > p").getall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use add ::text
, to let the query know you only want the text within those elements selected. By doing this you can, probably, omit the following two lines of code, letting scrapy clean any html tags that may exist within the retrieved body. And somewhat guarantees to have unicode strings (which we are currently not getting).
body = response.css("#bsf_rt_marker > p ::text").getall()
This might be done on a different PR, maybe as a bonus on #73.
src/c4v/scraper/spiders/el_pitazo.py
Outdated
tags = self._get_tags(response) | ||
|
||
# categories | ||
categories = response.css(".tdb-entry-category") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this line can simply be
categories = response.css(".tdb-entry-category ::text").getall()
this returns a list with the text found on each element of class "tdb-entry-category"
src/c4v/scraper/spiders/el_pitazo.py
Outdated
""" | ||
Try to get tags from document if available | ||
""" | ||
tags = response.css(".tdb-tags > li > a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for categories applies here,
Important Changes
scrapy
as dependencybeutifulsoup4
as dependencyscraper
module inc4v-py/src
Problem
Create a toy scraper version to get basic data for El Pitazo, also being able to expand it with more implementations for other sources.
Proposed solution
scrape(url : str) -> ScrapedData
that receives an url, and scrapes data for that url if possible, raise an error otherwhise.spiders.py
file with valid implementations of scrapers for different sources (There's currently just one, for El Pitazo)settings.py
file with mappings from domains to spiders, so we can choose which implementation works for each, defering domain-level logic to that spider.Tasks