Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Luis/toy webscrapper #68

Merged
merged 20 commits into from
May 17, 2021
Merged

Luis/toy webscrapper #68

merged 20 commits into from
May 17, 2021

Conversation

LDiazN
Copy link
Collaborator

@LDiazN LDiazN commented Apr 19, 2021

Important Changes

  • Added scrapy as dependency
  • Added beutifulsoup4 as dependency
  • Added scraper module in c4v-py/src

Problem

Create a toy scraper version to get basic data for El Pitazo, also being able to expand it with more implementations for other sources.

Proposed solution

  • expose a public scrape(url : str) -> ScrapedData that receives an url, and scrapes data for that url if possible, raise an error otherwhise.
  • Create a spiders.py file with valid implementations of scrapers for different sources (There's currently just one, for El Pitazo)
  • Create a settings.py file with mappings from domains to spiders, so we can choose which implementation works for each, defering domain-level logic to that spider.

Tasks

  • Create scraper spider for ElPitazo, working only for article urls

@marianelamin
Copy link
Collaborator

no need to push .lock file, since it can be generated with poetry install or update.


_process = CrawlerProcess(CRAWLER_SETTINGS)

def scrape(url : str) -> List[dict]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create unit tests for the scrape function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the design docs, we are aiming for given an URL return the scraped data str. A Dict would make more sense so we are able to scale, but I don't see why we would be having a list of dictionaries, just from one url.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to, I just don't know how yet, since I need to find a way to test a web scraper and decoupling tests from the http request to the page being scraped. I have a question, where should I write the test once they've been wrote? With nox tests? or in its own file/folder within the scraper module?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been adding our tests (not necessarily unit tests) under tests/data folder since our existing modules are in src/data. Like mirroring the src folder structure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the tests in their own folder, for now, since the scraper is in /src/bla-bla-scraper, put the tests in a tests folder at the top level (e.g. /tests/bla-bla-specs). This way, our tests use our library—instead of being part of the library.

My two cents regarding testing. You do not test the scraper (Scrapy already does that), and you do not test that the webpage works—because websites are fragile and monitoring that scrapers work and alerting somebody is not part of the scope of this phase.

You should test the expected behavior of all the code that you wrote. By thinking hard about testing the implemented code, you decouple external modules from the core ones. For example, by decoupling the HTTP adapter and the Scraping Engine from our scraping code itself. Remember that great testability is a symptom of a great design.

That way, we can mock parts of the Scrapy framework, HTTP responses, or any outsider code. Only if it's needed that is :)

However, this is hard to do! Try not to get frustrated if you are struggling. It's normal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the tests, I would also suggest that instead of writing the tests in src/scraper/tests, you write them in /tests

@asciidiego
Copy link
Collaborator

no need to push .lock file, since it can be generated with poetry install or update.

I think we should (purposely) use lock files. After all, they represent the latest successful build of dependencies of the project. The lock file is safer since it is more specific than just a version number and, therefore, we should always commit it.

@asciidiego
Copy link
Collaborator

Just a reminder: whenever the PR is ready for review, remove its draft status, and (I think) it will appear on my Github notification dashboard.

@VKorelsky
Copy link
Collaborator

no need to push .lock file, since it can be generated with poetry install or update.

I think we should (purposely) use lock files. After all, they represent the latest successful build of dependencies of the project. The lock file is safer since it is more specific than just a version number and, therefore, we should always commit it.

Agree with this, lock files are usually committed :)

scraper = ElPitazoScraper()
parse_output = scraper.parse(response)

# body = ????? TODO still don't know how to test body equality
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for the body test, you'll have to copy a super long string and use that
different ways to make the test a bit nicer to read in this case are:

  • define a method in the same test file that returns the expected body as a string, that way you just say
assert parse_output['title'] == getExpectedBody()
 ... 
def getExpectedBody():
      "super long string that you expect to be scraped from the website....." 
  • define a method as above, but put the expected body in a different text file and read from it
assert parse_output['title'] == getExpectedBody()
 ... 
def getExpectedBody():
      # <code that will go open a text file and read it as a string>

I would suggest the first approach

def test_title_parse_ok():
url = "tests/html_bodies/el_pitazo_fallas_electricas_carne.html"
test_file = os.path.join(ROOT_DIR, url)
response = fake_response_from_file(test_file, "https://elpitazo.net/cronicas/la-fallas-electricas-han-disminuido-la-cantidad-de-carne-que-consume-el-venezolano/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 👍🏽

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said to Luis, I would not even test that the website is correctly parsed. Why? Website layouts are fragile, and fragile tests are bad tests.

There needs to be a separation of concern: one thing is parsing correctly, and another one is using the parser correctly. We need to test the latter because that's the purpose of a spider—among other things. Creating these kinds of tests for one spider is simple, but as websites are added and they change, it will be a hell to maintain.

Instead, we should (eventually) prepare the system for monitoring and alerts, but I would not worry about that—for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Website layouts are fragile, and fragile tests are bad tests.

In my opinion, if the code is expected to go read a webpage and extract information from it - there should be at least a test asserting that if the website looks one way, then X information is extracted from it. Not sure how else you would test that your scraper is working correctly on one version of the intended website. When the website changes layout, then we can change the test html we've saved for that version.

If way more websites are added and this becomes complex, then we can refactor our approach.

What other approaches do you suggest for testing this?

Copy link
Collaborator

@VKorelsky VKorelsky Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and another one is using the parser correctly. We need to test the latter because that's the purpose of a spider—among other things.

not sure I get what you mean by using the parser correctly - how do you suggest this is tested?

Copy link
Collaborator

@asciidiego asciidiego Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding testing fragile code.

Just like you would (usually) not test a UI. Integration and E2E tests are expensive and a hell to maintain because they are fragile. I would apply the Humble Object pattern. Make the parser humble: its only job is to execute a parsing function (that has already been tested, e.g. beautifulsoup, lxml, ...). We make sure that the parser works (pass simple HTML strings with different selectors) and then move on.

The interesting logic we want to check is not whether something has been correctly configured or not. We want to assert that our program can configure itself based on the inputs (now that is easy to test). For example, URL starts with pitazo? Pitazo scraper it is. The Pitazo scraper does not return anything because the selectors are wrong? Well, that's the problem of the configuration, not the class. The class works. Its behavior is correct. No need for file fixtures. No need to update both code and tests every time a single div changes or when a CSS/XPath selector is not general enough.

When the website changes its layout (how are you going to know that it changed?), our tests must not break. An alert system should. We should not couple a whole Python library to a website CSS and HTML. Let alone multiple.

Copy link
Collaborator

@asciidiego asciidiego Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, if the code is expected to go read a webpage and extract information from it [...]

The code is expected to configure a spider appropriately based on the URL. It might be expected to clean the output a bit assuming a correct parsing. It is expected to do something when no input was found (because of connection errors or simple bad CSS selectors or layout changes). For example, just write an empty string to a Dict.

But, I do not expect the code to extract information from a webpage with 100% accuracy just like I do not expect an NLP algorithm to classify text perfectly. That's why we should validate our scraping engine with a monitoring system just like a Machine Learning Pipeline is constantly being evaluated and re-trained. We can only test deterministic algorithms. Scraping something that is not under our control is not one of those.

Caveat: the main reason why I would not test fragile things (configuration and layout) is because of the hell that is to maintain. Even for a single spider. However, there have been tries to fix this issue even by Scrapy maintainers themselves. But, in my opinion, it's an uphill battle. There are alternatives, but it looks like the cure is worse than the disease1,2.

[1] https://github.com/scrapinghub/scrapy-autounit#caveats
[2] https://docs.scrapy.org/en/latest/topics/contracts.html

Copy link
Collaborator

@VKorelsky VKorelsky Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the website changes its layout (how are you going to know that it changed?), our tests must not break.

Our tests won't break, since that one sample page is stored locally as an html file in the project.
In the case of a mostly static website, where most pages have the same layout (chatting with Luis, looks like el pitazo is mostly on this line) - I think the value of the integration test will outweigh the maintenance burden of it (which should be very little).

If that website changes (and all pages adopt a new layout) - then we'll have to update our parser anyways. (I'm with you on the fact that an alerting system should warn us of this, not our tests). Once we're aware the site changed, then we'd take a snapshot of what the new layout (hopefully still as static) is like for an article page, and update our test with it.

Of course, in the case of a more complicated website, or one where the set up is more complex - the overhead of maintenance increases a lot and the overall value therefore decreases. In which case I agree with you more we shouldn't aim to write such tests.

Now there may be some scrappy specifics that make this impractical - about these I'm all ears :)

My end impression is simply that for a website whose pages are static and follow the same layout - there is more value than cost in this kind of snapshot based integration test.

from typing import List


class SpiderManager:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class will need some unit tests too :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look at how to test this, given that it has the CrawlerProcess + spider things.
@m4r4c00ch0 maybe you would have some input here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's exactly the purpose of this class? I don't like the word Manager because it looks like this module might become a God Object. But, I might be wrong. Is this a Mediator?

p.s. I am used to writing tests before programming, so this might need refactoring before we can test it easily. I would give you my advice on how to unit test this class, but first I need to know its purpose.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's purpose is to implement scraping and parsing functions for spiders, so they are not directly managed by a BaseScraper subclass. This way, BaseScrapyScraper can just delegate that behavior on the manager object. Maybe it's better to merge that logic into BaseScrapyScraper? Since the latter is just redirecting behavior into SpiderManager.

from typing import List


class BaseScraper:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

Comment on lines 70 to 72
for known_url in URL_TO_SCRAPER.keys():
if known_url in url:
return URL_TO_SCRAPER[known_url]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here for efficiency I would suggest:

  • validating that the URL is valid
  • extracting the domain name from the URL
  • using the domain name as the lookup key in the map

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to validate the URL, the two easier options I found are:

  • Using a regex like this from django
  • Using a validators library like this

What approach do you think is the best? I would prefer to the first one since it does not requires a new dependency

from .settings import URL_TO_SCRAPER


def scrape(url: str) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these methods will need some unit tests as well :)

from bs4 import BeautifulSoup


def clean(element: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def clean(element: str) -> str:
def strip_html_tags(element: str) -> str:

better to go with a name that matches exactly what this does, clean seems too broad in my opinion.
What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right, there's no good reason to leave it as just "clean".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advice: after you write tests for this function/module, you'll (probably) have a better guess of what the best name would be.

@@ -0,0 +1,42 @@
"""
Misc. helper functions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, tests for this :)


def __init__(self, spider) -> None:
self.spider = spider
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a pass here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it shouldn't be there, I just forgot delete it. Thanks for notice.

from typing import List


class SpiderManager:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's exactly the purpose of this class? I don't like the word Manager because it looks like this module might become a God Object. But, I might be wrong. Is this a Mediator?

p.s. I am used to writing tests before programming, so this might need refactoring before we can test it easily. I would give you my advice on how to unit test this class, but first I need to know its purpose.

def test_title_parse_ok():
url = "tests/html_bodies/el_pitazo_fallas_electricas_carne.html"
test_file = os.path.join(ROOT_DIR, url)
response = fake_response_from_file(test_file, "https://elpitazo.net/cronicas/la-fallas-electricas-han-disminuido-la-cantidad-de-carne-que-consume-el-venezolano/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said to Luis, I would not even test that the website is correctly parsed. Why? Website layouts are fragile, and fragile tests are bad tests.

There needs to be a separation of concern: one thing is parsing correctly, and another one is using the parser correctly. We need to test the latter because that's the purpose of a spider—among other things. Creating these kinds of tests for one spider is simple, but as websites are added and they change, it will be a hell to maintain.

Instead, we should (eventually) prepare the system for monitoring and alerts, but I would not worry about that—for now.

from bs4 import BeautifulSoup


def clean(element: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advice: after you write tests for this function/module, you'll (probably) have a better guess of what the best name would be.

self._spider_manager = SpiderManager(self.spider)

def parse(self, response) -> dict:
# Delegate parse function to scrapy implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove these redundant comments. I can already see that the function it's delegating its scrape implementation to the manager.

…d function to get domain from url; optimized map from url to scraper implementation
Copy link
Member

@Edilmo Edilmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Nice work.
Whenever @VKorelsky and/or @m4r4c00ch0 think is ok, go ahead.

My comments could be addressed in separate tasks/PRs, up to you.

@@ -0,0 +1,80 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole package is misplaced.
By convention, we would like everything inside our c4v-py project be imported like this:

import c4v.scraper.X
...
from c4v.scraper import X

and with this you would have:

import from scraper.X
...
from scraper import X

The latter is the pythonic way to organize things, the former would be a library with more than one global package, which is extremely rare at least for me).

I recommend to stick to the normal case.


# Python imports
from typing import List, Type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is the first step and because of that is ok for me if the following observation is handle in a follow-up task/PR.

Currently we are returning a Dict, but I encourage you to start as soon as possible to work with explicit typed data structures.
More precisely I would like to see a hierarchy of DataClasses representing outputs. And this hierarchy, up to some extent, should be aligned with the current hierarchy you already put in place in this PR.
And, if can be more strict and picky, I would like to see the current hierarchy of classes being defined as Generics which are parametrized by the output DataClasses.
Another important observation is that those DataClasses should be seeing as Inmutable objects. They could have methods performing computations, but they should avoid mutating the object itself. That design should help you to make everything more friendly with pipelining of tasks and future parallelizations.

Base class for scrapers implementations
"""

def parse(self, response) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I noticed that you are following the typing which is great.
If we would like to be strict, the typing here has two mistakes:

  1. response has no type: maybe is because Scrapy does not have type hints. Is that right?
  2. dict is not a valid type hint. The typing module defined a type called Dict. That one is Generic and a typical annotation example is like this:
def parse(self, response: MyType) -> Dict[str, Any]:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About 1.
No, it does have type hints, we decided to keep it as "any" type so every scraper could have its own response type according to its needs. But I'll agree that we should have a base class in order to have a minimal interface for every response object, and (using scrapy as an example) a ScrapyResponse implementation for scrapy Response object could just delegate the interface implementation to scrapy functions, for example:

class BaseResponse:
  def url(self) -> str: # just an example
     pass
     
class ScrapyResponse(BaseResponse):
  def __init__(self, scrapy_response : scrapy.http.Response):
     self.scrapy_response = scrapy_response
     
  def url(self) -> str:
     return self.scrapy_response.url()

I'll be addressing this issue in some other PR
About 2.
You're right, I should change every dict signature. I wasn't sure which one I should have used when I first wrote it. I'll change it ASAP

"""

# Settings passed to the crawler
CRAWLER_SETTINGS = {"LOG_ENABLED": False}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see that you guys are already considering configuration settings and at the same time keeping it simple!!!

I like that, please create a task for me when you get time to bring a configuration library into the project.
here the one I'm thinking to use: https://www.dynaconf.com
cc: @dieko95


# start scrapping
self.process.crawl(crawler)
self.process.start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocking call, is it?

If so, I'm not sure about this API. The doc is saying that you return the scrapped data, but start in the crawler process is intended as an async pattern and you need to use the join method. But maybe I'm wrong.

categories = response.css(".tdb-entry-category").getall()
categories = list(map(utils.strip_http_tags, categories))

return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to my previous comment about using typed objects for outputs. Or more precisely, a hierarchy of DataClasses.

body = filter(lambda p: p.startswith("<p>") and p.endswith("</p>"), body)
body = map(utils.strip_http_tags, body)

body = "\n".join(body)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logics like this are suitable to be in the data class as an computed property. That will allow you to reuse this little amount of code easily. I know that is small code, but once you are scrapping 100 sources, one line of repeated code becomes 100 lines of sparse repeated code. And if you need to change it, then it will be a pain.



def get_body_for_parse_ok():
return "Ganaderos de la zona del Sur del Lago de Maracaibo, área estratégica potencial para el desarrollo agropecuario de la nación, reportaron una disminución en el sacrificio de bovinos debido a los incesantes racionamientos eléctricos.\nAsí lo informó Daniel Ariza, presidente de la Asociación de Ganaderos y Agricultores del municipio Colón (Aganaco), quien aseguró que la medida se tomó en los frigoríficos: Frigorifico Industrial Sur del Lago (Frisulca), Frigorífico Catatutmo (Fricasa) y Frigorífico Industrial Santa Bárbara (Fibasa), para evitar acumulación de reses en canal. “Solo se ejecuta matanza en algunas unidades de producción y los índices descienden con el paso de los días”.\nPrecisó que aun cuando algunos frigoríficos cuentan con plantas eléctricas, estos carecen de combustible, transporte y logística para mantenerlas encendidas todos los días, tal como lo ameritan los cárnicos en cadena de frío. Solo hasta el 8 de abril, Corpoelec diseñó un plan que ha divulgado a través de las redes sociales sin vocería del Gobierno y donde incluye a los 12 circuitos de la subregión en bloques de hasta seis horas diarias de racionamiento.\n“Muchos de los ganaderos dejan de matar animales a raíz de esta situación tan difícil; fue una manera de controlar la pérdida de carne de res, de cerdo y otros animales que acá se producen, manteniéndolos vivos y destinarlos al sacrificio solo cuando Corpoelec asome una posible estabilidad en los circuitos”.\nTambién en las carnicerías optaron por bajar la compra de cárnicos. Los alimentos en estos expendios tienden a perderse con más facilidad, porque esta cadena de comercialización no ha invertido en un plan alternativo para mantener la vida útil de las cavas o cuartos fríos.\nPara el primer apagón, los carniceros se valieron de amigos y productores que resguardaron parte de la carne. Otra la vendieron a bajo costo y una minoría regaló la disponibilidad en las exhibidoras.\nMientras esto ocurre, la carne de res en las expendedoras locales disminuyó y con ello el consumo de proteína animal en los hogares venezolanos. Pero no solo se trata de la carne: la medida incluye los huesos, vísceras o las piezas duras como piel y cuernos del animal.\nLa producción de leche también va en retroceso y descendió desde marzo. Armando Chacín, presidente de la Federación Nacional de Ganaderos de Venezuela (Fedenaga), indicó a\xa0El Pitazo\xa0que durante los primeros cinco días del primer apagón, registrado por un incidente en Guri el 7 de marzo reciente, la pérdida de producción diaria nacional de leche cruda se ubicó en un millón 150.000 litros por día. Sin embargo, luego de tomarse “medidas urgentes”, el gremio logró reducir la cifra.\nUn litro de leche a puerta de planta es cancelado en mil bolívares. Según la federación,\xa0la tercera parte de la producción de leche es destinada a derivados como chicha, choco y demás subproductos que no ameritan pasteurización. Manifestó que quesos y demás preparaciones dependen del lácteo que de igual manera ameritan los neonatos en fase de crecimiento.\nDijo que los 120.000 productores que arriman el lácteo a las distintas manufactureras, procesadoras, pasteurizadoras y queseras debieron paralizar una parte del ordeño en sus unidades de producción.\n“Se originó una situación imprevista. Buena parte del rebaño nacional dejó de ser ordeñado y las vacas fueron destinadas a amamantar becerros.\nNuestra preocupación es que estos animales se van a secar porque les están dando la leche a sus crías y esto amerita que el ordeño se vuelva a realizar hasta un nuevo parto, en un lapso de nueve meses”, estimó Chacín.\nEsta organización resumió las consecuencias de estar tantas horas sin luz en el sector agropecuario de la región en la pérdida de 100.000 litros de leche. Aseguran que el sector ha sido uno de los más golpeados por la falta de electricidad.\nNo se habían recuperado de las consecuencias del primer apagón cuando el segundo interrumpió la cadena de producción. Sin fluido eléctrico es imposible enfriar y mantener en buen estado la leche; las pocas empresas que poseen plantas no operan sin combustible, elemento muy escaso en el municipio Machiques, en donde se dificulta cada día el suministro de gasolina y gasoil. Como consecuencia de esta escasez, también se dificulta el transporte de leche, y por ende la elaboración del queso y su comercialización."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move all this big string to files with resources.
you could have a CSV which can be load as fixture in the conftests. Or a JSON. Or anything that will be simple enough to maintain and that helps you to remove clutter out of the code.

"""
Check that ElPitazoScraper parses a valid page as expected
"""
url = "tests/html_bodies/el_pitazo_fallas_electricas_carne.html"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try to put anything that is a resource inside a resources folder. That is more pythonic. Or perform a quick search for recommended practices for resources in pytest. I think moving the html_bodies folder inside a resources folder should be enough.



def get_body_for_parse_ok():
return "Ganaderos de la zona del Sur del Lago de Maracaibo, área estratégica potencial para el desarrollo agropecuario de la nación, reportaron una disminución en el sacrificio de bovinos debido a los incesantes racionamientos eléctricos.\nAsí lo informó Daniel Ariza, presidente de la Asociación de Ganaderos y Agricultores del municipio Colón (Aganaco), quien aseguró que la medida se tomó en los frigoríficos: Frigorifico Industrial Sur del Lago (Frisulca), Frigorífico Catatutmo (Fricasa) y Frigorífico Industrial Santa Bárbara (Fibasa), para evitar acumulación de reses en canal. “Solo se ejecuta matanza en algunas unidades de producción y los índices descienden con el paso de los días”.\nPrecisó que aun cuando algunos frigoríficos cuentan con plantas eléctricas, estos carecen de combustible, transporte y logística para mantenerlas encendidas todos los días, tal como lo ameritan los cárnicos en cadena de frío. Solo hasta el 8 de abril, Corpoelec diseñó un plan que ha divulgado a través de las redes sociales sin vocería del Gobierno y donde incluye a los 12 circuitos de la subregión en bloques de hasta seis horas diarias de racionamiento.\n“Muchos de los ganaderos dejan de matar animales a raíz de esta situación tan difícil; fue una manera de controlar la pérdida de carne de res, de cerdo y otros animales que acá se producen, manteniéndolos vivos y destinarlos al sacrificio solo cuando Corpoelec asome una posible estabilidad en los circuitos”.\nTambién en las carnicerías optaron por bajar la compra de cárnicos. Los alimentos en estos expendios tienden a perderse con más facilidad, porque esta cadena de comercialización no ha invertido en un plan alternativo para mantener la vida útil de las cavas o cuartos fríos.\nPara el primer apagón, los carniceros se valieron de amigos y productores que resguardaron parte de la carne. Otra la vendieron a bajo costo y una minoría regaló la disponibilidad en las exhibidoras.\nMientras esto ocurre, la carne de res en las expendedoras locales disminuyó y con ello el consumo de proteína animal en los hogares venezolanos. Pero no solo se trata de la carne: la medida incluye los huesos, vísceras o las piezas duras como piel y cuernos del animal.\nLa producción de leche también va en retroceso y descendió desde marzo. Armando Chacín, presidente de la Federación Nacional de Ganaderos de Venezuela (Fedenaga), indicó a\xa0El Pitazo\xa0que durante los primeros cinco días del primer apagón, registrado por un incidente en Guri el 7 de marzo reciente, la pérdida de producción diaria nacional de leche cruda se ubicó en un millón 150.000 litros por día. Sin embargo, luego de tomarse “medidas urgentes”, el gremio logró reducir la cifra.\nUn litro de leche a puerta de planta es cancelado en mil bolívares. Según la federación,\xa0la tercera parte de la producción de leche es destinada a derivados como chicha, choco y demás subproductos que no ameritan pasteurización. Manifestó que quesos y demás preparaciones dependen del lácteo que de igual manera ameritan los neonatos en fase de crecimiento.\nDijo que los 120.000 productores que arriman el lácteo a las distintas manufactureras, procesadoras, pasteurizadoras y queseras debieron paralizar una parte del ordeño en sus unidades de producción.\n“Se originó una situación imprevista. Buena parte del rebaño nacional dejó de ser ordeñado y las vacas fueron destinadas a amamantar becerros.\nNuestra preocupación es que estos animales se van a secar porque les están dando la leche a sus crías y esto amerita que el ordeño se vuelva a realizar hasta un nuevo parto, en un lapso de nueve meses”, estimó Chacín.\nEsta organización resumió las consecuencias de estar tantas horas sin luz en el sector agropecuario de la región en la pérdida de 100.000 litros de leche. Aseguran que el sector ha sido uno de los más golpeados por la falta de electricidad.\nNo se habían recuperado de las consecuencias del primer apagón cuando el segundo interrumpió la cadena de producción. Sin fluido eléctrico es imposible enfriar y mantener en buen estado la leche; las pocas empresas que poseen plantas no operan sin combustible, elemento muy escaso en el municipio Machiques, en donde se dificulta cada día el suministro de gasolina y gasoil. Como consecuencia de esta escasez, también se dificulta el transporte de leche, y por ende la elaboración del queso y su comercialización."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same coment here, let's use a resource asset for this long strings

"""
Get article body as a single string
"""
body = response.css("#bsf_rt_marker > p").getall()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use add ::text, to let the query know you only want the text within those elements selected. By doing this you can, probably, omit the following two lines of code, letting scrapy clean any html tags that may exist within the retrieved body. And somewhat guarantees to have unicode strings (which we are currently not getting).
body = response.css("#bsf_rt_marker > p ::text").getall()

This might be done on a different PR, maybe as a bonus on #73.

tags = self._get_tags(response)

# categories
categories = response.css(".tdb-entry-category")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this line can simply be
categories = response.css(".tdb-entry-category ::text").getall()
this returns a list with the text found on each element of class "tdb-entry-category"

"""
Try to get tags from document if available
"""
tags = response.css(".tdb-tags > li > a")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for categories applies here,

@LDiazN LDiazN marked this pull request as ready for review May 17, 2021 19:12
@LDiazN LDiazN merged commit 02e7935 into master May 17, 2021
@LDiazN LDiazN deleted the luis/toy-webscrapper branch May 17, 2021 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants