Skip to content

photonsquid/Recoinize-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coins parser

A simple parser for euro coins images.

Table of Contents

Context

For a deep learning project, we need a data set of all different euro coins. The main idea is to generate a lot of different photographs in Blender (different lighting, camera angles, background, focus distances, etc.) For this we need to dynamically build all coins in Blender, and for this, we need textures of all these coins. Here we are, scraping all this images.

The sum of the monetary values of all EU coins is €137.92. The sum of the comemorative €2 coins is €872. This is without counting the real value on the numismatic market, where some coins can be worth several hundred euros on their own.

Description

This script can scrape image URLs from different websites, and download them. Some scrapers are already implemented, but you can easily add your own (see Add a scraper).

Currently, there are two scrapers implemented:

Images are downloaded in the following folder structure:

{root}/coins/{scraperDetails}/{countryCode}_{value}_{particularity}.{imageExtension}

Where:

  • {root} is the root folder given as argument (cf. here)
  • {countryCode} is the country code in two letters of the coin (e.g. fr for France, ad for Andorra, etc.)
  • {scraperName} can be null, (depending on the scraper settings, cf. Add a scraper), if null, images are downloaded in {root}/coins/
  • {value} is the value of the coin (e.g. 1euro, 2cents, etc.), cf. this list
  • {particularity} is the particularity of the coin (e.g. 2019, 2018, 2017, etc.) or null if there is only one coin for this country (cf. here)
  • {imageExtension} is the extension of the image (jpg, .png, etc.) from the scraped website.

Examples of images:

  • ./coins/va_50cents_2017.jpg
  • ./coins/lv_1euro.jpg
  • ./coins/va_10cents_SedeVacante.jpg

Installation

It depends on the part of the script you want to execute. If you already get JSON files, and don't want to do the scraping part, you can simply run:

pip install -r requirements-downloader.txt

Else, you need to install playwright:

pip install -r requirements-scraper.txt

(requirements-scraper.txt) includes requirements-downloader.txt, so you don't need to install it twice.

Usages

If, for example, you want to provide the root folder argument, you can do it like this:

python ./main.py -r ./images

Arguments

This is the list of different arguments that can be passed to the command.

short long default short description
-r --root "./images" root folder where images will be downloaded, and JSON file created
-s --scrape false if true, no image download, only JSON scraping
-h --help display help
-d --debug false if true, debug mode (lot of logs)
-q --quiet false if true, quiet mode (no output)

-r, --root

This is the root folder that contains the JSON file. All images will be downloaded in {root}/coins/ If this argument is not provided, the default value will be read from src/constants.py (cf. DEFAULT_ROOT_FOLDER).

-s, --scrape

If true, no image download, only JSON scraping. If a JSON file ({root}/{json}) file already exists, it will override it.

Add a scraper

I didn't have time to write docs about this, but you can see examples in ./src/scrapers/.

Your classes have to inherit from Scraper and implement the scraper method. The constructor has to have this line: super().__init__(self, args, logger, NAME, BASE_URL), where:

  • args is the arguments passed to the command (cf. here)
  • loger is the logger from the app (main.py)
  • NAME, name of the scraper. This is compulsory, and will be used to create the JSON file.
  • BASE_URL, base URL of the website.

The scrape has to return a dictionary with the following keys:

key type description
countryCode string country code in two letters, cf. this list
value string value of the coin, cf. this list
url string URL of the coin
particularity string or null particularity of the coin, cf. this list
imageExtension string or null extension of the image
special_path string or null special path of the image

The special path is used in the root file. If it is null (default), then the image will be downloaded in {root}/coins/. If it is not null, then the image will be downloaded in {root}/coins/{special_path}/.

JSON file

When data is scraped, a new JSON file is created containing data for each image. It must respect the schema defined in ./schema.json:

Example of JSON file:

[
    {
        "countryCode": "va",
        "value": "50cents",
        "particularity": "2017",
        "url": "https://www.ecb.europa.eu/euro/coins/html/va/50c_2017.en.html"
    },
]

Where:

field type description
countryCode string country code in two letters, cf. this list
value string value of the coin, cf. this list
particularity string or null particularity of the coin, cf. here
url uri (string) URL of the scraped website

Special coins

Some countries have only one coin since 2002, and some have more than one. For example, Vatican City has a new coin for each pope. So, for these countries, the particularity is saved. It can be a year, or a name. It is null if there is only one coin.

List of countries

You can find the list of countries here.

List of coins

You can find the list of all euro coins here.

Regular coins

coin values
2euro
1euro
50cents
20cents
10cents
5cents
2cents
1cent

About

A simple parser for euro coins images.

Resources

Stars

Watchers

Forks

Languages