Skip to content

Commit

Permalink
Choose custom strategy to find best Favicon
Browse files Browse the repository at this point in the history
  • Loading branch information
AlexMili committed Dec 21, 2024
1 parent 2936c7d commit fae198b
Show file tree
Hide file tree
Showing 6 changed files with 211 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,9 @@ Key features include:
* **Availability Checks**: Validates each favicon’s URL, following redirects and marking icons as reachable or not.
* **DuckDuckGo Support**: Downloads Favicon directly from DuckDuckGo's public favicon API.
* **Google Support**: Downloads Favicon directly from Google's public favicon API.
* **Custom Strategy**: Sets the order in which the different available techniques are used to retrieve the best favicon.
* **Generate Favicon**: Generate a default SVG favicon when none are available.
* **Get Best Favicon**: Easily gets the best Favicon available, generate one if none are found.
* **Async Support**: Offers asynchronous methods (via `asyncio`) to efficiently handle multiple favicon extractions concurrently, enhancing overall performance when dealing with numerous URLs.

## Installation
Expand Down
23 changes: 23 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,9 @@ Key features include:
* **Availability Checks**: Validates each favicon’s URL, following redirects and marking icons as reachable or not.
* **DuckDuckGo Support**: Downloads Favicon directly from DuckDuckGo's public favicon API.
* **Google Support**: Downloads Favicon directly from Google's public favicon API.
* **Custom Strategy**: Sets the order in which the different available techniques are used to retrieve the best favicon.
* **Generate Favicon**: Generate a default SVG favicon when none are available.
* **Get Best Favicon**: Easily gets the best Favicon available, generate one if none are found.
* **Async Support**: Offers asynchronous methods (via `asyncio`) to efficiently handle multiple favicon extractions concurrently, enhancing overall performance when dealing with numerous URLs.

## Installation
Expand Down Expand Up @@ -123,6 +125,27 @@ placeholder_favicon = generate_favicon("https://example.com")
print("Generated favicon URL:", placeholder_favicon.url)
```

### Get the Best Favicon Available

The `get_best_favicon` function tries multiple techniques in a specified order to find the best possible favicon. By default, the order is:

* `content`: Attempts to extract favicons from HTML or directly from the URL.
* `duckduckgo`: Fetches a favicon from DuckDuckGo if the first step fails.
* `google`: Retrieves a favicon from Google if the previous steps fails.
* `generate`: Generates a placeholder if no other method is successful.

The function returns the first valid favicon found or None if none is discovered.

```python
best_icon = get_best_favicon("https://example.com")

if best_icon:
print("Best favicon URL:", best_icon.url)
print("Favicon dimensions:", best_icon.width, "x", best_icon.height)
else:
print("No valid favicon found for this URL.")
```

## Dependencies

When you install `extract_favicon` it comes with the following dependencies:
Expand Down
2 changes: 2 additions & 0 deletions src/extract_favicon/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from_html,
from_url,
generate_favicon,
get_best_favicon,
guess_missing_sizes,
)

Expand All @@ -20,6 +21,7 @@
"from_html",
"from_url",
"generate_favicon",
"get_best_favicon",
"guess_missing_sizes",
]

Expand Down
85 changes: 85 additions & 0 deletions src/extract_favicon/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -453,3 +453,88 @@ def generate_favicon(url: str) -> Favicon:
return favicon


def get_best_favicon(
url: str,
html: Optional[Union[str, bytes]] = None,
client: Optional[Client] = None,
strategy: list[str] = ["content", "duckduckgo", "google", "generate"],
) -> Optional[Favicon]:
"""
Attempts to retrieve the best favicon for a given URL using multiple strategies.
The function iterates over the specified strategies in order, stopping as soon as a valid
favicon is found:
- "content": Parses the provided HTML (if any) or fetches page content from the URL to
extract favicons. It then guesses missing sizes, checks availability, and downloads
the largest icon.
- "duckduckgo": Retrieves a favicon from DuckDuckGo if the previous step fails.
- "google": Retrieves a favicon from Google if the previous step fails.
- "generate": Generates a placeholder favicon if all else fails.
Args:
url: The URL for which the favicon is being retrieved.
html: Optional HTML content to parse. If not provided, the page content is retrieved
from the URL.
client: Optional HTTP client to use for network requests.
strategy: A list of strategy names to attempt in sequence. Defaults to
["content", "duckduckgo", "google", "generate"].
Returns:
The best found favicon if successful, otherwise None.
Raises:
ValueError: If an unrecognized strategy name is encountered in the list.
"""
favicon = None

for strat in strategy:
if strat.lower() not in STRATEGIES:
raise ValueError(f"{strat} strategy not recognized. Aborting.")

if strat.lower() == "content":
favicons = set()

if html is not None and len(html) > 0:
favicons = from_html(
str(html), root_url=_get_root_url(url), include_fallbacks=True
)
else:
favicons = from_url(url, include_fallbacks=True, client=client)

favicons_data = guess_missing_sizes(favicons, load_base64_img=True)
favicons_data = check_availability(favicons_data, client=client)

favicons_data = download(favicons_data, mode="largest", client=client)

if len(favicons_data) > 0:
favicon = favicons_data[0]

elif strat.lower() == "duckduckgo":
fav = from_duckduckgo(url, client)

if (
fav.reachable is True
and fav.valid is True
and fav.width > 0
and fav.height > 0
):
favicon = fav

elif strat.lower() == "google":
fav = from_google(url, client)

if (
fav.reachable is True
and fav.valid is True
and fav.width > 0
and fav.height > 0
):
favicon = fav

elif strat.lower() == "generate":
favicon = generate_favicon(url)

if favicon is not None:
break

return favicon
61 changes: 61 additions & 0 deletions src/extract_favicon/main_async.py
Original file line number Diff line number Diff line change
Expand Up @@ -238,3 +238,64 @@ async def check_availability(
await asyncio.sleep(sleep_time)

return favs


async def get_best_favicon(
url: str,
html: Optional[Union[str, bytes]] = None,
client: Optional[AsyncClient] = None,
strategy: list[str] = ["content", "duckduckgo", "google", "generate"],
) -> Optional[Favicon]:
favicon = None

for strat in strategy:
if strat.lower() not in STRATEGIES:
raise ValueError(f"{strat} strategy not recognized. Aborting.")

if strat.lower() == "content":
favicons = set()

if html is not None and len(html) > 0:
favicons = from_html(
str(html), root_url=_get_root_url(url), include_fallbacks=True
)
else:
favicons = await from_url(url, include_fallbacks=True, client=client)

favicons_data = await guess_missing_sizes(favicons, load_base64_img=True)
favicons_data = await check_availability(favicons_data, client=client)

favicons_data = await download(favicons_data, mode="largest", client=client)

if len(favicons_data) > 0:
favicon = favicons_data[0]

elif strat.lower() == "duckduckgo":
fav = await from_duckduckgo(url, client)

if (
fav.reachable is True
and fav.valid is True
and fav.width > 0
and fav.height > 0
):
favicon = fav

elif strat.lower() == "google":
fav = await from_google(url, client)

if (
fav.reachable is True
and fav.valid is True
and fav.width > 0
and fav.height > 0
):
favicon = fav

elif strat.lower() == "generate":
favicon = generate_favicon(url)

if favicon is not None:
break

return favicon
38 changes: 38 additions & 0 deletions test/test_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,3 +162,41 @@ def test_generate_default():
assert favicon.height == 100


@pytest.mark.parametrize(
"url,strategy,img_format,width,height",
[
(
"https://www.trustlist.ai/",
["content", "duckduckgo", "google", "generate"],
"png",
300,
300,
),
("https://www.trustlist.ai/", ["generate"], "svg", 100, 100),
("https://www.trustlist.ai/", ["duckduckgo"], "png", 300, 300),
("https://www.trustlist.ai/", ["google"], "png", 256, 256),
(
"https://somerandometld.trustlist.ai/",
["content", "duckduckgo", "google", "generate"],
"svg",
100,
100,
),
],
ids=[
"Default strategy",
"Gen first strat",
"Duckduckgo first strat",
"Google first strat",
"Default strategy unknown domain",
],
)
def test_best_favicon(url, strategy, img_format, width, height):
favicon = extract_favicon.get_best_favicon(url, strategy=strategy)

assert favicon is not None
assert favicon.format == img_format
assert favicon.reachable is True
assert favicon.valid is True
assert favicon.width == width
assert favicon.height == height

0 comments on commit fae198b

Please sign in to comment.