Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
    • Ic_icpsp
  • hub

Last edited by fanzx May 27, 2021
Page history
This is an old version of this page. You can view the most recent version or browse the history.

hub

基本信息

工商.亮照服务--湖北
ic_icpsp_req/hub 通过提交任务参数中的credit_no进行搜索,如果credit_no参数值为空,则通过company_code或者company_name通过亮照服务查询获取credit_no再进行查询。

数据名称(中文)

湖北--工商.亮照服务

数据英文名称

ic_icpsp

采集网站(采集入口)

官网PC端入口:
http://wsdj.egs.gov.cn/ICPSP/queryEnt.action
采集文件存放路径:
/data/gravel_spiders/ic_icpsp_spider

采集频率及采集策略

存量更新策略

db_host: bdp-rds-007.mysql.rds.aliyuncs.com
db_name: utn_ic
db_user: shuidi
db_password: 
数据库表名:tb_search_company_icpsp
全省的主体信息作为搜索条件
逐条更新
目前全量更新一轮即可.

增量采集策略

1.新成立的主体
2.补充的主体

爬虫

湖北--工商.亮照服务 ic_icpsp_spiders

责任人

郭本江

爬虫名称

ic_icpsp_spider

代码地址

项目地址:http://192.168.109.110/granite/project-gravel/-/tree/develop_ic_icpsp/scrapy_spiders/gravel_spiders/spiders/ic_icpsp_reqs/hub

队列名称及队列地址

  • redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
  • redis port: 6379
  • redis db: 7
  • redis key:
    • ic_icpsp_keys

优先级队列说明

  • ic_icpsp_keys 支持队列优先级

任务来源

taskhub 湖北的全量主体信息

任务输入参数(样例) 其中credit_no、province、company_name、company_code为必需

破码版本
{
   "company_name": "武汉鸿雁房地产代理有限公司",
   "company_code":  "4201002111621",
   "credit_no":  "",
   "province":  "HUB",
   "data_detail_url":  "http://xyjg.egs.gov.cn/ECPS_HB/company/detail.jspx?id=E8D47B3680F92E8CE588BF508DCB88BE",
   "company_major_type":  "3",
   "company_status":  "已吊销",
   "legal_person":  "刘桂强",
   "company_address":  "武昌区紫阳路147号",
   "create_time":  "2016-11-09 07:00:45.194000+00:00",
   "lastupdatetime":  "2021-03-28 14:57:26+00:00",
   "submit_time":  "",
   "authority":  "武汉市工商行政管理局",
   "n_company_status":  "吊销",
   "company_name_digest":  "3c7e408bcaedd3a960f0948ad81cbb74",
   "district_code":  ""
}

任务样例


{
	"data": {
		"icpsp_items": [
			{
				"search_keyword": "91420115MA49RG6T9L",
				"province": "HUB",
				"company_name": "武汉中肯商贸有限公司",
				"credit_no": "91420115MA49RG6T9L",
				"legal_person": "张耘",
				"establish_date": "2021-05-17",
				"detail_url": "/ICPSP/pdf.action?id=160000020925690093&lzcx=lzcx",
				"data_id": "160000020925690093",
				"company_address": "湖北省武汉市江夏区郑店街黄金南路8号汽车制造专用模具、夹具、检具生产线4号厂房",
				"company_type_code": "1152",
				"company_type": "有限责任公司(自然人投资或控股的法人独资)",
				"capital": 50,
				"company_code": "",
				"company_status": "",
				"legal_person_certno": "F3A10E8FD2D7882064EC67721F9DEBE9"
			}
		]
	},
	"http_code": 200,
	"error_msg": "",
	"task_result": 1000,
	"data_type": "detail",
	"spider_start_time": "2021-05-24 18:04:27.775",
	"spider_end_time": "2021-05-24 18:04:31",
	"task_params": {
		"province": "HUB",
		"credit_no": "91420115MA49RG6T9L",
		"company_code": "",
		"company_name": "武汉中肯商贸有限公司"
	},
	"metadata": {},
	"spider_name": "ic_icpsp_spider",
	"spider_ip": "10.8.0.26"
}

任务参数说明

破码版本
{
   "company_name": "武汉鸿雁房地产代理有限公司",
   "company_code":  "4201002111621",
   "credit_no":  "",
   "province":  "HUB",
   "data_detail_url":  "http://xyjg.egs.gov.cn/ECPS_HB/company/detail.jspx?id=E8D47B3680F92E8CE588BF508DCB88BE",
   "company_major_type":  "3",
   "company_status":  "已吊销",
   "legal_person":  "刘桂强",
   "company_address":  "武昌区紫阳路147号",
   "create_time":  "2016-11-09 07:00:45.194000+00:00",
   "lastupdatetime":  "2021-03-28 14:57:26+00:00",
   "submit_time":  "",
   "authority":  "武汉市工商行政管理局",
   "n_company_status":  "吊销",
   "company_name_digest":  "3c7e408bcaedd3a960f0948ad81cbb74",
   "district_code":  ""
}

data_type说明

detail: 详情信息

爬虫结果的超级数据

{
	"data": {
		"icpsp_items": [
			{
				"search_keyword": "91420115MA49RG6T9L",  --搜索关键词
				"province": "HUB",                       --省份 此处固定为HUB
				"company_name": "武汉中肯商贸有限公司",    -- 公司名称
				"credit_no": "91420115MA49RG6T9L",       -- 统一社会信用代码
				"legal_person": "张耘",                   -- 法人名称
				"establish_date": "2021-05-17",           --成立日期
				"detail_url": "/ICPSP/pdf.action?id=160000020925690093&lzcx=lzcx",                                         -- 详情url
				"data_id": "160000020925690093",          -- 详情url中的id参数
				"company_address": "湖北省武汉市江夏区郑店街黄金南路8号汽车制造专用模具、夹具、检具生产线4号厂房",                                                -- 登记地址
				"company_type_code": "1152",              -- 企业类型编码
				"company_type": "有限责任公司(自然人投资或控股的法人独资)", --企业类型
				"capital": 50,                            -- 注册资本
				"company_code": "",                       -- 注册号
				"company_status": "",                     -- 登记状态
				"legal_person_certno": "F3A10E8FD2D7882064EC67721F9DEBE9" --法人ID加密后的内容
			}
		]
	},
	"http_code": 200,
	"error_msg": "",
	"task_result": 1000,        -- 任务结果状态 1000成功 1101查询无结果 其它:任务爬取失败
	"data_type": "detail",
	"spider_start_time": "2021-05-24 18:04:27.775",
	"spider_end_time": "2021-05-24 18:04:31",
	"task_params": {
		"province": "HUB",
		"credit_no": "91420115MA49RG6T9L",
		"company_code": "",
		"company_name": "武汉中肯商贸有限公司"
	},
	"metadata": {},
	"spider_name": "ic_icpsp_spider",
	"spider_ip": "10.8.0.26"
}

实际爬虫结果的数据结构

{
	"data": {
		"icpsp_items": [
			{
				"search_keyword": "91422822MA490LYE01",
				"province": "HUB",
				"company_address": "建始县业州镇团结路新马路",
				"company_name": "湖北文图档案服务有限公司",
				"company_type_code": "1151",
				"company_type": "有限责任公司(自然人独资)",
				"establish_date": "2017-07-17",
				"capital": 80,
				"company_code": "",
				"credit_no": "91422822MA490LYE01",
				"legal_person": "侯贤才",
				"company_status": "",
				"legal_person_certno": "5B031B5292CCAF1EE26CAD8B09820982"
			}
		]
	},
	"http_code": 200,
	"error_msg": "",
	"task_result": 1000,
	"data_type": "detail",
	"spider_start_time": "2021-05-25 09:24:57.669",
	"spider_end_time": "2021-05-25 09:25:00",
	"task_params": {
		"company_name": "湖北文图档案服务有限公司",
		"company_code": "422822000035850",
		"credit_no": "91422822MA490LYE01",
		"province": "HUB",
		"data_detail_url": "http://xyjg.egs.gov.cn/ECPS_HB/company/detail.jspx?id=4F4A1397BB7B4B77A08DD1F6F5A1CE94",
		"company_major_type": "3",
		"company_status": "吊销,未注销",
		"legal_person": "侯贤才",
		"company_address": "建始县业州镇团结路新马路",
		"create_time": "2017-11-10 01:18:18+00:00",
		"lastupdatetime": "2021-03-23 00:44:22+00:00",
		"submit_time": "",
		"authority": "建始县市场监督管理局",
		"n_company_status": "吊销",
		"company_name_digest": "c1cda85ab5600d17b4b9f28784f41406",
		"district_code": "422822"
	},
	"metadata": {},
	"spider_name": "ic_icpsp_spider",
	"spider_ip": "10.8.6.51"
}

爬虫运行环境

scrapy

爬虫部署信息

ic_icpsp_spiders: 10.8.6.51 35个进程  

Taskhub地址

提交任务地址: http://10.8.6.222:18518/task/
代码编写地址: http://192.168.109.110/granite/project-gravel/blob/develop_app_10jqka_20210121/app_general_taxpayer/data_pump/general_taxpayer.yml

Taskhub调度规则说明

task_result=1000    # 正常获取到详情任务
task_result=1101    # 无结果信息
task_result=9101    # 超时错误,需要进行重试,目前重试5次
task_result=8000    # 参数错误

爬虫监控指标设计

(先观察,待补充)
索引: 
监控频率: 
监控起止时间: 
报警条件: 
报警群:  
报警内容: 

数据归集

责任人

数据归集方式

  • 爬虫直接写kafka

  • 爬虫写文件logstash采集

爬虫结果目录

归集后存放目录

logstash配置文件名称

logstash文件采集type

数据归集的topic

ES日志索引及筛选条件

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • crontab + data_pump
  • supervisor + data_pump
  • supervisor + consumer

数据接收来源

数据存储表地址

  • 数据库地址:
  • 表名:
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages