Skip to content

Commit

Permalink
Upgrade library
Browse files Browse the repository at this point in the history
* Improve base library
* Upgrade dependencies
  • Loading branch information
joskfg authored Mar 23, 2022
1 parent 1f105a3 commit ae3e629
Show file tree
Hide file tree
Showing 7 changed files with 93 additions and 47 deletions.
5 changes: 1 addition & 4 deletions .github/workflows/php.yml → .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: PHP Composer
name: Build

on:
push:
Expand Down Expand Up @@ -30,6 +30,3 @@ jobs:

- name: Run test suite
run: composer run-script test

- name: PHP_CodeSniffer Check with Annotations
uses: chekalsky/[email protected]
19 changes: 19 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
.PHONY: tests
tests:
docker-compose run composer run test

.PHONY: debug
debug:
docker-compose run --entrypoint=bash composer

.PHONY: update-dependencies
update-dependencies:
docker-compose run composer update

.PHONY: checkstyle
checkstyle:
docker-compose run composer run checkstyle

.PHONY: fix-checkstyle
fix-checkstyle:
docker-compose run composer run fix-checkstyle
29 changes: 13 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,11 @@

[![Latest Version](https://img.shields.io/github/release/joskfg/laravel-intelligent-scraper.svg?style=flat-square)](https://github.com/joskfg/laravel-intelligent-scraper/releases)
[![Software License](https://img.shields.io/badge/license-Apache%202.0-blue.svg?style=flat-square)](LICENSE.md)
[![Build Status](https://img.shields.io/travis/joskfg/laravel-intelligent-scraper/master.svg?style=flat-square)](https://travis-ci.org/joskfg/laravel-intelligent-scraper)
[![Coverage Status](https://img.shields.io/scrutinizer/coverage/g/joskfg/laravel-intelligent-scraper.svg?style=flat-square)](https://scrutinizer-ci.com/g/joskfg/laravel-intelligent-scraper/code-structure)
[![Quality Score](https://img.shields.io/scrutinizer/g/joskfg/laravel-intelligent-scraper.svg?style=flat-square)](https://scrutinizer-ci.com/g/joskfg/laravel-intelligent-scraper)
[![Build Status](https://github.com/joskfg/laravel-intelligent-scraper/actions/workflows/build.yml/badge.svg)](https://github.com/joskfg/laravel-intelligent-scraper/actions/workflows/build.yml)
[![Total Downloads](https://img.shields.io/packagist/dt/joskfg/laravel-intelligent-scraper.svg?style=flat-square)](https://packagist.org/packages/joskfg/laravel-intelligent-scraper)
[![Average time to resolve an issue](http://isitmaintained.com/badge/resolution/joskfg/laravel-intelligent-scraper.svg?style=flat-square)](http://isitmaintained.com/project/joskfg/laravel-intelligent-scraper "Average time to resolve an issue")
[![Percentage of issues still open](http://isitmaintained.com/badge/open/joskfg/laravel-intelligent-scraper.svg?style=flat-square)](http://isitmaintained.com/project/joskfg/laravel-intelligent-scraper "Percentage of issues still open")


This packages offers a scraping solution that doesn't require to know the web HTML structure and it is autoconfigured
when some change is detected in the HTML structure. This allows you to continue scraping without manual intervention
during a long time.
Expand Down Expand Up @@ -63,8 +60,8 @@ The default stack already has the http_errors middleware, so you only need to do

## Configuration

There are two different options for the initial setup. The package can be
[configured using datasets](#configuration-based-in-dataset) or
There are two different options for the initial setup. The package can be
[configured using datasets](#configuration-based-in-dataset) or
[configured using Xpath](#configuration-based-in-xpath). Both ways produce the same result but
depending on your Xpath knowledge you could prefer one or other. We recommend using the
[configured using Xpath](#configuration-based-in-xpath) approach.
Expand Down Expand Up @@ -103,18 +100,18 @@ trying to cover maximum page variations possible. The scraper WILL NOT BE ABLE t
in the dataset.

Once we did the job, all is ready to work. You should not care about updates always you have enough data in the dataset
to cover all the new modifications on the page, so the scraper will recalculate the modifications on the fly. You can
to cover all the new modifications on the page, so the scraper will recalculate the modifications on the fly. You can
check [how it works](how-it-works.md) to know much about the internals.

We will check more deeply how we can create a new dataset and what options are available in the next section.

#### Dataset creation

The dataset is composed by `url` and `data`.
The dataset is composed by `url` and `data`.
* The `url` part is simple, you just need to indicate the url from where you obtained the data.
* The `type` part gives a item name to the current dataset. This allows you to define multiple types.
* The `variant` identifies the page variant. The identifier is a sha1 hash build based in the xpath used to get the data.
* The `data` part is where you indicate what data and assign the label that you want to get.
* The `data` part is where you indicate what data and assign the label that you want to get.
The data could be a list of items or a single item.

A basic example could be:
Expand Down Expand Up @@ -166,10 +163,10 @@ ScrapedDataset::create([
]);
```

With this change we will ensure that we detect the `body` even if it has hidden characters.
With this change we will ensure that we detect the `body` even if it has hidden characters.

**IMPORTANT** The scraper tries to find the text in all the tags including children, so if you define a regular
expression without limit, like for example `/.*Body starts.*/` you will find the text in `<html>` element due to that
expression without limit, like for example `/.*Body starts.*/` you will find the text in `<html>` element due to that
text is inside some child element of `<html>`. So define regexp carefully.

### Configuration based in Xpath
Expand Down Expand Up @@ -207,7 +204,7 @@ are not going to trigger the [reconfiguration process](#configure-scraper).

After configure the scraper, you will be able to request a specific scrape using the `scrape` helper
```php
<?php
<?php

scrape('https://test.c/p/my-objective', 'Item-definition-1');
```
Expand Down Expand Up @@ -300,17 +297,17 @@ php artisan queue:work --queue=configure # Just one
To run the tests, run the following command from the project folder.

``` bash
$ docker-compose run test
$ make tests
```

To run interactively using [PsySH](http://psysh.org/):
To open a terminal in the dev environment:
``` bash
$ docker-compose run psysh
$ make debug
```

## How it works?

The scraper is auto configurable, but needs an initial dataset or add a configuration.
The scraper is auto configurable, but needs an initial dataset or add a configuration.
The dataset tells the configurator which data do you want and how to label it.

There are three services that have unique responsibilities and are connected using the event system.
Expand Down
35 changes: 24 additions & 11 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,23 @@
"issues": "https://github.com/joskfg/laravel-intelligent-scraper/issues"
},
"require": {
"php": ">=7.4",
"php": ">= 8.0",
"fabpot/goutte": "^3.2",
"psr/log": "^2.0",
"illuminate/database": "^5.8 || ^6.0 || ^7.0 || ^8.0",
"illuminate/events": "^5.8 || ^6.0 || ^7.0 || ^8.0",
"psr/log": "^1.0 || ^2.0 || ^3.0",
"illuminate/database": "^7.0 || ^8.0",
"illuminate/events": "^7.0 || ^8.0",
"ext-dom": "*",
"ext-json": "*"
},
"require-dev": {
"phpunit/phpunit": "^9.0",
"mockery/mockery": "^1.0",
"friendsofphp/php-cs-fixer": "^2.4",
"friendsofphp/php-cs-fixer": "^3.8",
"laravel/legacy-factories": "^1.1",
"squizlabs/php_codesniffer": "^3",
"orchestra/testbench": "^6.18",
"orchestra/database": "^6.0",
"rector/rector": "^0.11.20"
"rector/rector": "^0.12.18"
},
"autoload": {
"files": [
Expand All @@ -45,11 +45,24 @@
}
},
"scripts": {
"all-test": "phpunit --coverage-text; php-cs-fixer fix -v --diff --dry-run --allow-risky=yes;",
"test": "phpunit --coverage-text --testsuite=Unit; php-cs-fixer fix -v --diff --dry-run --allow-risky=yes;",
"phpunit": "phpunit --coverage-text",
"phpcs": "php-cs-fixer fix -v --diff --dry-run --allow-risky=yes;",
"fix-cs": "php-cs-fixer fix -v --diff --allow-risky=yes;"
"all-tests": [
"@checkstyle",
"phpunit --coverage-text"
],
"test": [
"@checkstyle",
"phpunit --coverage-text --testsuite Unit"
],
"checkstyle": [
"php-cs-fixer fix -v --diff --dry-run --allow-risky=yes",
"rector --dry-run"
],
"fix-checkstyle": [
"@php-cs-fixer",
"@rector"
],
"php-cs-fixer": "php-cs-fixer fix -v --diff --allow-risky=yes",
"rector": "rector"
},
"extra": {
"laravel": {
Expand Down
36 changes: 25 additions & 11 deletions phpunit.xml
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
<?xml version="1.0" encoding="UTF-8"?>
<phpunit xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" bootstrap="vendor/autoload.php" backupGlobals="false" backupStaticAttributes="false" colors="true" verbose="true" convertErrorsToExceptions="true" convertNoticesToExceptions="true" convertWarningsToExceptions="true" processIsolation="false" stopOnFailure="false" xsi:noNamespaceSchemaLocation="https://schema.phpunit.de/9.3/phpunit.xsd">
<coverage>
<include>
<directory suffix=".php">src</directory>
</include>
<report>
<clover outputFile="build/clover.xml"/>
<html outputDirectory="build/coverage"/>
<text outputFile="build/coverage.txt"/>
</report>
</coverage>
<phpunit bootstrap="vendor/autoload.php"
backupGlobals="false"
backupStaticAttributes="false"
colors="true"
verbose="true"
convertErrorsToExceptions="true"
convertNoticesToExceptions="true"
convertWarningsToExceptions="true"
processIsolation="false"
stopOnFailure="false">

<testsuites>
<testsuite name="Unit">
<directory>tests/Unit</directory>
Expand All @@ -18,10 +18,24 @@
<directory>tests/Integration</directory>
</testsuite>
</testsuites>

<coverage processUncoveredFiles="true">
<include>
<directory suffix=".php">src</directory>
</include>
<report>
<clover outputFile="build/clover.xml"/>
<html outputDirectory="build/coverage"/>
<text outputFile="build/coverage.txt"/>
</report>
</coverage>

<logging>
<junit outputFile="build/report.junit.xml"/>
</logging>

<php>
<env name="DB_CONNECTION" value="testing"/>
</php>

</phpunit>
8 changes: 3 additions & 5 deletions rector.php
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,13 @@
// get parameters
$parameters = $containerConfigurator->parameters();

$parameters->set(Option::PATHS, [__DIR__ . '/src', __DIR__ . '/tests']);

// Define what rule sets will be applied
$containerConfigurator->import(SetList::DEAD_CODE);
$containerConfigurator->import(SetList::PHP_72);
$containerConfigurator->import(SetList::PHP_73);
$containerConfigurator->import(SetList::PHP_74);
$containerConfigurator->import(SetList::PHP_80);
$containerConfigurator->import(SetList::TYPE_DECLARATION_STRICT);
$containerConfigurator->import(SetList::TYPE_DECLARATION);
$containerConfigurator->import(SetList::EARLY_RETURN);
$containerConfigurator->import(SetList::PRIVATIZATION);


// get services (needed for register a single rule)
Expand Down
8 changes: 8 additions & 0 deletions tests/Unit/Scraper/Repositories/ConfigurationTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
use Illuminate\Foundation\Testing\DatabaseMigrations;
use Illuminate\Support\Facades\App;
use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\Log;
use Joskfg\LaravelIntelligentScraper\Scraper\Application\Configurator;
use Joskfg\LaravelIntelligentScraper\Scraper\Models\Configuration as ConfigurationModel;
use Joskfg\LaravelIntelligentScraper\Scraper\Models\ScrapedDataset;
Expand All @@ -14,6 +15,13 @@ class ConfigurationTest extends TestCase
{
use DatabaseMigrations;

public function setUp(): void
{
parent::setUp();

Log::spy();
}

/**
* @test
*/
Expand Down

0 comments on commit ae3e629

Please sign in to comment.