Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 上古老铁 tid=25084945 无法用 client.get_posts() 查出数据。 #213

Closed
Sorceresssis opened this issue Jul 19, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@Sorceresssis
Copy link

简要描述这个bug

一时兴起,想找最楼最长的帖子,爬取一下试试,但是发现 tid=25084945 的帖子请求不到数据

如何复现

在何种场景下用何种操作复现

你希望程序作出何种行为

...

截图(可选)

...

@lumina37 lumina37 added the bug Something isn't working label Jul 20, 2024
@lumina37
Copy link
Owner

远古ip账号没有user_id导致的,我回去想想怎么修比较好

@n0099
Copy link

n0099 commented Jul 20, 2024

https://n0099.net/tbm/v1/client_tester.php?type=replies&tid=25084945&pn=1&client_version=12.62.1.0
我记得远古匿名ip用户的uid是负数,portrait是 #77 (comment) 之前的老版本 https://github.com/n0099/open-tbm/blob/b81b64a95c4bf84b9782173ffe5ff9fc23d0b729/c%23/crawler/src/Tieba/Crawl/Parser/UserParser.cs#L16-L18
我不记得19年php爬虫时代为何有百度uid0远古匿名ip用户 n0099/open-tbm@84fcdaf 哦当时试图处理的就是这种情况

$ curl -s https://n0099.net/tbm/v1/client_tester.php\?type\=replies\&tid\=25084945\&pn\=1\&client_version\=12.62.1.0 \
| jq '.post_list[] | select(.id == 222283034) | .author_id'
0

如果有多个远古匿名ip用户每个的百度uid都是0

$ curl -s https://n0099.net/tbm/v1/client_tester.php\?type\=replies\&tid\=25084945\&pn\=1\&client_version\=12.62.1.0 \
| jq '.post_list[] | select(.author_id == 0) | .floor'
2819
3435
3437
3443
3444
3446
3447
3453
3454

.user_list只有第一个出现(楼层/pid靠前?)百度uid0的信息

$ curl -s https://n0099.net/tbm/v1/client_tester.php\?type\=replies\&tid\=25084945\&pn\=1\&client_version\=12.62.1.0 \
| jq '.user_list[] | select(.id == 0)'
{
  "id": 0,
  "name_show": "220.190.163.*",
  "portrait": "00003232302e3139302e3136332e3230380000"
}

我的建议是看到出现了百度uid0远古匿名ip用户就用远古版本_client_version重新请求从嵌套的post.user结构中取出

$ curl -s https://n0099.net/tbm/v1/client_tester.php\?type\=replies\&tid\=25084945\&pn\=1\&client_version\=6.0.2 \
| jq '.post_list[] | .author | select(.id < 1)'
{
  "id": 0,
  "name": "220.190.163.*",
  "name_show": "220.190.163.*",
  "portrait": "00003232302e3139302e3136332e3230380000",
  "level_id": 0,
  "is_bawu": 0,
  "bawu_type": "",
  "gender": 0
}
{
  "portrait": "00003231392e3134392e34362e37350000",
  "level_id": 0,
  "is_bawu": 0,
  "bawu_type": "",
  "gender": 0,
  "id": 0,
  "name": "219.149.46.*",
  "name_show": "219.149.46.*"
}
{
  "name_show": "221.194.106.*",
  "portrait": "00003232312e3139342e3130362e3138320000",
  "level_id": 0,
  "is_bawu": 0,
  "bawu_type": "",
  "gender": 0,
  "id": 0,
  "name": "221.194.106.*"
}
{
  "gender": 0,
  "id": 0,
  "name": "61.135.146.*",
  "name_show": "61.135.146.*",
  "portrait": "000036312e3133352e3134362e3234310000",
  "level_id": 0,
  "is_bawu": 0,
  "bawu_type": ""
}
{
  "gender": 0,
  "id": 0,
  "name": "222.76.156.*",
  "name_show": "222.76.156.*",
  "portrait": "00003232322e37362e3135362e3135370000",
  "level_id": 0,
  "is_bawu": 0,
  "bawu_type": ""
}
{
  "portrait": "00003231302e32392e3135372e33350000",
  "level_id": 0,
  "is_bawu": 0,
  "bawu_type": "",
  "gender": 0,
  "id": 0,
  "name": "210.29.157.*",
  "name_show": "210.29.157.*"
}
{
  "id": 0,
  "name": "210.29.157.*",
  "name_show": "210.29.157.*",
  "portrait": "00003231302e32392e3135372e33350000",
  "level_id": 0,
  "is_bawu": 0,
  "bawu_type": "",
  "gender": 0
}
{
  "portrait": "00003230322e39372e3134342e3233310000",
  "level_id": 0,
  "is_bawu": 0,
  "bawu_type": "",
  "gender": 0,
  "id": 0,
  "name": "202.97.144.*",
  "name_show": "202.97.144.*"
}
{
  "id": 0,
  "name": "202.97.144.*",
  "name_show": "202.97.144.*",
  "portrait": "00003230322e39372e3134342e3233310000",
  "level_id": 0,
  "is_bawu": 0,
  "bawu_type": "",
  "gender": 0
}

@lumina37
Copy link
Owner

不过滤user_id=0的用户就行了, c46b084 应该已经修复

@n0099
Copy link

n0099 commented Jul 21, 2024

那您就拿不到除了第一个出现的220.190.163.*之外的百度uid0远古匿名ip用户

@lumina37
Copy link
Owner

要支持那种功能的话得调版本号了,实在麻烦

@Sorceresssis
Copy link
Author

Sorceresssis commented Jul 21, 2024

之前没想到 user_id 会不存在, 那看来还是要优先使用 portrait 了。post, FragAT 这些在保存时还应该要给他们添加 portrait 字段。查询也要优先使用 portrait了。

@n0099
Copy link

n0099 commented Jul 22, 2024

-之前没想到 user_id 可能为0
+之前没想到 user_id 会不存在

首先数值0/空字符串在protobuf encoding语境下等效于不存在 https://stackoverflow.com/questions/21227924/handling-null-values-in-protobuffers 因为其根本不会encode进值=默认值(0/"")的field也导致其没有null
这就是为什么其需要用nullable monad来方便decode时区分默认值和null避免 https://en.wikipedia.org/wiki/Semipredicate_problem

那看来还是要优先使用 portrait 了

portrait有两个版本,您在这看到的这些远古匿名ip用户的portrait是第一版
#77 (comment)

而18年贴吧管理器群某位神必人曾经指出他在贴吧前端js中翻阅到的老portrait的生成算法就是把百度用户名的utf8字节倒序拼接几遍

如果未来又突然出现了第三版portrait您打算把哪个版本当做 https://en.wikipedia.org/wiki/Single_source_of_truth ?至少截止2024年7月,百度uid还没有变动过(除了前几年从大致符合顺序自增id换成了从大数开始随机分配id以试图避免 #124 (comment)

post, FragAT 这些在保存时还应该要给他们添加 portrait 字段

那您就得再去调接口根据百度uid第二版portrait #77 (comment) 从一个已知没换过的变成已知换过的id系统

查询也要优先使用 portrait了

更多的是许多接口不接受百度uid只能用不稳定的百度用户名 贴吧覆盖ID 第1/2版portrait #77 (comment)

我只是调侃一下信息最全的用户页接口竟然是以portrait为参数

@n0099
Copy link

n0099 commented Jul 22, 2024

要支持那种功能的话得调版本号了,实在麻烦

对于我的tbm而言本来就支持对单个tid/pid/spid发出复数个请求 https://github.com/n0099/open-tbm/blob/b81b64a95c4bf84b9782173ffe5ff9fc23d0b729/c%23/crawler/src/Tieba/Crawl/Crawler/ThreadCrawler.cs#L40-L47 再merge他们 https://github.com/search?q=repo%3An0099%2Fopen-tbm+ThreadClientVersion602&type=code
但对于现在这个仅在大约2010年前才会存在的百度uid0远古匿名ip用户我也不想无脑浪费10rpsbudget #82 (comment) 而是条件性地

看到出现了百度uid0远古匿名ip用户就用远古版本

更麻烦的是由于存在百度uid0远古匿名ip用户(如前所述我记得以前是不重复的负百度uid,其可能是从十进制ipv4地址 indutny/node-ip#136 (comment) indutny/node-ip#138 (comment) 派生出来)他们的百度uid都是重复的0所以需要像不久前的主题帖最后回复贴人 #208 那样单独建模存储 n0099/open-tbm@c8f4920

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants