基本信息
爬虫整体说明
1. 在手机钉钉搜索钉钉企典,进入到搜索页之后, 进行抓包, root手机,并且将charles证书导入到root目录。
2. 通过company_name进行搜索,获取到搜索结果+推荐结果, 一般是20条
3. 本次采集主要是获取手机号以及对应的手机号标签,尤其是招投标手机号
task_result说明
task_result=1000 所有搜索到的,推荐的,都标记为1000
task_result=1101, 啥也没有搜到
字段说明
状态码特别说明
1000 正常获取到数据
1101 未搜索到任何结果
数据名称(中文)
钉钉企典
数据英文名称
dingdingqidian
采集网站(采集入口)
app抓包对应的搜索链接
https://holmes.taobao.com/ding/corp/customer/searchWithSummary
是post请求
采集频率及采集策略
存量更新策略
先拿招投标企业进行遍历一遍 (一次性工作)
所有的企业遍历例行 (长期性工作)
增量采集策略
责任人
袁波
爬虫名称
dingdingqidian
代码地址
项目地址:
http://office.pingansec.com:30080/granite/project-gravel/-/tree/dingdingqidian_20220311_2
## 队列名称及队列地址
<!--redis host port db key 优先级说明-->
-
* redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
* redis port: 6379
* redis db: 7
* redis key:
* dingdingqidian:10
搜索-任务输入参数(样例)
{
"credit_no": "91530302397409612J",
"company_name_digest": "0055c8d645205a8f1a46f7d2b94e5f76",
"company_name": "曲靖市麒麟区乐友家庭农场",
"company_code": "530302200120172",
}
data_type说明
实际爬虫结果的数据结构
{
"data": [{
"companyName": "山西优倍房地产经纪有限公司",
"companyNameWithSummary": "山西优倍房地产经纪<mark>有限公司</mark>",
"legalPerson": "李华荣",
"registerDate": "2021-03-30",
"address": "山西省太原市小店区亲贤北街50号3幢6层0601、0602号(入驻太原市昊成商务秘书服务有限公司2021一B004)",
"ocid": "1210331000227664331",
"lon": 112.570799,
"lat": 37.823001,
"registerCapital": "500万人民币",
"logo": null,
"staffSize": null,
"tags": ["法人企业"],
"bizStatus": "在营",
"bAddMonitor": false,
"bFavorite": false,
"bTraced": null,
"industry": "房地产业",
"industryLevel1": null,
"industryLevel2": null,
"phone": "15803402644",
"email": "",
"garden": null,
"brands": null,
"officialSite": "",
"finDate": null,
"finType": null,
"finAmount": null,
"investor": null,
"flushTime": 0,
"taxNum": "91140100MA0LHJHU45",
"socialCreditCode": "91140100MA0LHJHU45",
"hitSummary": true,
"summaryFieldName": "注册地址",
"summaryFieldValue": "山西省太原...<mark>太原市昊成</mark>商务...",
"identityId": "520201007157553816",
"latestBiddingItem": null,
"biddingCnt": null,
"parkId": null,
"maxConnectionTelephone": null,
"totalTelephoneNum": null,
"telephoneDetailLists": [{
"telephone": "15803402644",
"connectRate": "低",
"sameTelephoneCrops": "18",
"sameTelephoneCropsOcid": ["1210422000337127975", "1200821000005032220", "1210907000181543642", "1200918000067161876", "1210428000345690559", "1210808000063010960", "1210907000181490512", "1210329000216853740", "1210708000081884615", "1210831000107056386", "1210607000161307947", "1210528000191065063", "1210821000072711207", "1210801000105507488", "1190916008383194349", "1210528000191069728", "1210618000157424216", "1210423000484915186"],
"specialTags": null,
"telephoneSource": "其他"
}],
"entBrief": null,
"emailList": null,
"officialSiteList": null,
"isMobilephone": null,
"regCity": "<mark>太原市</mark>",
"regProvince": "山西省",
"regArea": "小店区",
"opScope": "建设<mark>工程</mark>(<mark>建筑</mark>施工:<mark>建筑</mark>...",
"isStrictDingOrg": false,
"companyPhoneInfo": null
}],
"http_code": 200,
"error_msg": "",
"task_result": 1000,
"data_type": "detail",
"spider_start_time": "2022-03-15 16:56:34.100",
"spider_end_time": "2022-03-15 16:56:37.333",
"task_params": {
"credit_no": "91140108MA0KN8K93A",
"company_name_digest": "0043d184afba75246a8080ebe9b6a801",
"company_is_sme": 1,
"company_name": "太原市昊成建筑工程有限公司",
"company_code": "140108065078963",
"_id": "0043d184afba75246a8080ebe9b6a801"
},
"metadata": {},
"spider_name": "dingdingqidian",
"spider_ip": "10.8.6.4",
"proxy_ip": "http://10.8.6.219:1805"
}
爬虫运行环境
scrapy
爬虫部署信息
crontab任务对应机器collie用户: 待添加
爬虫部署机器: 10.8.6.4 10个进程
Taskhub地址
Taskhub调度规则说明
爬虫监控指标设计
爬虫待采集结果目录
/data/gravel_spiders/dingdingqidian
数据归集
责任人
范召贤
数据归集方式
-
爬虫直接写kafka
-
爬虫写文件logstash采集
爬虫结果目录
归集后存放目录
/data2_227/grvael_spider_result/dingdingqidian
logstash配置文件名称
project-deploy/logstash/10.8.6.246/conf.d/collie_spider_data_to_kfk.conf(入topic)
project-deploy/logstash/10.8.6.229/conf.d/grvael/grvael_spider_to_es.conf(入es)
logstash文件采集type
type=>"dingdingqidian"
数据归集的topic
topic_id => "general-taxpayer"
ES日志索引及筛选条件
index => "gravel-spider-data-%{log_date}"
监控指标看板
数据保留策略
数据清洗
责任人
代码地址
部署地址
部署方法及说明
- crontab + data_pump
- supervisor + data_pump
- supervisor + consumer
数据接收来源
归集的文件