Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
    • Company
  • qidian

Last edited by 倾尽天下 Mar 17, 2022
Page history

qidian

基本信息

爬虫整体说明

1. 在手机钉钉搜索钉钉企典,进入到搜索页之后, 进行抓包, root手机,并且将charles证书导入到root目录。
2. 通过company_name进行搜索,获取到搜索结果+推荐结果, 一般是20条
3. 本次采集主要是获取手机号以及对应的手机号标签,尤其是招投标手机号

task_result说明

task_result=1000  所有搜索到的,推荐的,都标记为1000
task_result=1101, 啥也没有搜到

字段说明

状态码特别说明

1000 正常获取到数据
1101 未搜索到任何结果

数据名称(中文)

钉钉企典

数据英文名称

dingdingqidian

采集网站(采集入口)

app抓包对应的搜索链接
https://holmes.taobao.com/ding/corp/customer/searchWithSummary
是post请求

采集频率及采集策略

存量更新策略

先拿招投标企业进行遍历一遍  (一次性工作)
所有的企业遍历例行    (长期性工作)

增量采集策略


责任人

袁波

爬虫名称

dingdingqidian

代码地址

项目地址:
http://office.pingansec.com:30080/granite/project-gravel/-/tree/dingdingqidian_20220311_2

## 队列名称及队列地址
<!--redis host port db key 优先级说明-->
-
* redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
* redis port: 6379
* redis db: 7
* redis key: 
    * dingdingqidian:10

搜索-任务输入参数(样例)

{
	"credit_no": "91530302397409612J",
	"company_name_digest": "0055c8d645205a8f1a46f7d2b94e5f76",
	"company_name": "曲靖市麒麟区乐友家庭农场",
	"company_code": "530302200120172",
}

data_type说明

实际爬虫结果的数据结构

{
	"data": [{
		"companyName": "山西优倍房地产经纪有限公司",
		"companyNameWithSummary": "山西优倍房地产经纪&lt;mark&gt;有限公司&lt;/mark&gt;",
		"legalPerson": "李华荣",
		"registerDate": "2021-03-30",
		"address": "山西省太原市小店区亲贤北街50号3幢6层0601、0602号(入驻太原市昊成商务秘书服务有限公司2021一B004)",
		"ocid": "1210331000227664331",
		"lon": 112.570799,
		"lat": 37.823001,
		"registerCapital": "500万人民币",
		"logo": null,
		"staffSize": null,
		"tags": ["法人企业"],
		"bizStatus": "在营",
		"bAddMonitor": false,
		"bFavorite": false,
		"bTraced": null,
		"industry": "房地产业",
		"industryLevel1": null,
		"industryLevel2": null,
		"phone": "15803402644",
		"email": "",
		"garden": null,
		"brands": null,
		"officialSite": "",
		"finDate": null,
		"finType": null,
		"finAmount": null,
		"investor": null,
		"flushTime": 0,
		"taxNum": "91140100MA0LHJHU45",
		"socialCreditCode": "91140100MA0LHJHU45",
		"hitSummary": true,
		"summaryFieldName": "注册地址",
		"summaryFieldValue": "山西省太原...&lt;mark&gt;太原市昊成&lt;/mark&gt;商务...",
		"identityId": "520201007157553816",
		"latestBiddingItem": null,
		"biddingCnt": null,
		"parkId": null,
		"maxConnectionTelephone": null,
		"totalTelephoneNum": null,
		"telephoneDetailLists": [{
			"telephone": "15803402644",
			"connectRate": "低",
			"sameTelephoneCrops": "18",
			"sameTelephoneCropsOcid": ["1210422000337127975", "1200821000005032220", "1210907000181543642", "1200918000067161876", "1210428000345690559", "1210808000063010960", "1210907000181490512", "1210329000216853740", "1210708000081884615", "1210831000107056386", "1210607000161307947", "1210528000191065063", "1210821000072711207", "1210801000105507488", "1190916008383194349", "1210528000191069728", "1210618000157424216", "1210423000484915186"],
			"specialTags": null,
			"telephoneSource": "其他"
		}],
		"entBrief": null,
		"emailList": null,
		"officialSiteList": null,
		"isMobilephone": null,
		"regCity": "&lt;mark&gt;太原市&lt;/mark&gt;",
		"regProvince": "山西省",
		"regArea": "小店区",
		"opScope": "建设&lt;mark&gt;工程&lt;/mark&gt;(&lt;mark&gt;建筑&lt;/mark&gt;施工:&lt;mark&gt;建筑&lt;/mark&gt;...",
		"isStrictDingOrg": false,
		"companyPhoneInfo": null
	}],
	"http_code": 200,
	"error_msg": "",
	"task_result": 1000,
	"data_type": "detail",
	"spider_start_time": "2022-03-15 16:56:34.100",
	"spider_end_time": "2022-03-15 16:56:37.333",
	"task_params": {
		"credit_no": "91140108MA0KN8K93A",
		"company_name_digest": "0043d184afba75246a8080ebe9b6a801",
		"company_is_sme": 1,
		"company_name": "太原市昊成建筑工程有限公司",
		"company_code": "140108065078963",
		"_id": "0043d184afba75246a8080ebe9b6a801"
	},
	"metadata": {},
	"spider_name": "dingdingqidian",
	"spider_ip": "10.8.6.4",
	"proxy_ip": "http://10.8.6.219:1805"
}

爬虫运行环境

scrapy

爬虫部署信息

crontab任务对应机器collie用户:   待添加
爬虫部署机器:  10.8.6.4   10个进程

Taskhub地址

Taskhub调度规则说明

爬虫监控指标设计

爬虫待采集结果目录

/data/gravel_spiders/dingdingqidian

数据归集

责任人

范召贤

数据归集方式

  • 爬虫直接写kafka

  • 爬虫写文件logstash采集

爬虫结果目录

归集后存放目录

/data2_227/grvael_spider_result/dingdingqidian

logstash配置文件名称

project-deploy/logstash/10.8.6.246/conf.d/collie_spider_data_to_kfk.conf(入topic)
project-deploy/logstash/10.8.6.229/conf.d/grvael/grvael_spider_to_es.conf(入es)

logstash文件采集type

type=>"dingdingqidian"

数据归集的topic

topic_id => "general-taxpayer"

ES日志索引及筛选条件

index => "gravel-spider-data-%{log_date}"

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • crontab + data_pump
  • supervisor + data_pump
  • supervisor + consumer

数据接收来源

归集的文件

数据存储表地址

Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages