Skip to content
This repository has been archived by the owner on Nov 28, 2023. It is now read-only.

RSC速度优化 #14

Open
FeeiCN opened this issue Apr 25, 2018 · 1 comment
Open

RSC速度优化 #14

FeeiCN opened this issue Apr 25, 2018 · 1 comment
Labels
enhancement New feature or request

Comments

@FeeiCN
Copy link
Owner

FeeiCN commented Apr 25, 2018

对泛解析域名枚举时,最大的速度问题不是网络请求耗时,而是进行响应相似度比对。

Python中difflib.SequenceMatcher有三个字符串相似度比较方法:
real_quick_ratio(速度4) > quick_ratio(速度2) > ratio(速度1)

使用最快的real_quick_ratio在不本地字符串比对时,速度低于50/s
即使网络请求耗时忽略不计,仅对17万子域名进行响应相似度比对就得接近1个小时。

目前看来只能重写一套页面相似度算法。

@FeeiCN FeeiCN added the enhancement New feature or request label Apr 25, 2018
@FeeiCN
Copy link
Owner Author

FeeiCN commented Apr 25, 2018

feei-esd-1234.suning.comwww.suning.com进行测试。
有个专门比对HTML相似度的库(html-similarity),并不会快太多且根据样式或结构有可能误报。

import time
from difflib import SequenceMatcher
from html_similarity import style_similarity, structural_similarity, similarity

times = 1000

start_time = time.time()
for i in range(times):
    # 根据字符和词组进行相似度比对
    # 7/s 0.2468
    # x = SequenceMatcher(None, a, b).real_quick_ratio()
    # 根据HTML结构和样式进行相似度比对
    # 15/s 0.2688
    # x = similarity(a, b)
    # 根据HTML样式进行相似度比对
    # 28/s 0.3863
    # x = style_similarity(a, b)
    # 根据HTML结构进行相似度比对
    # 22/s 0.1512
    # x = structural_similarity(a, b)
    print(i, x)
t = time.time() - start_time
print(t)
print(times / t)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant