Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
    • Ic_icpsp
  • hub

Last edited by fanzx May 27, 2021
Page history

hub

基本信息

工商.亮照服务--湖北
ic_icpsp_req/hub 通过提交任务参数中的credit_no进行搜索,如果credit_no参数值为空,则通过company_code或者company_name通过亮照服务查询获取credit_no再进行查询。

数据名称(中文)

湖北--工商.亮照服务

数据英文名称

ic_icpsp

采集网站(采集入口)

官网PC端入口:
http://wsdj.egs.gov.cn/ICPSP/queryEnt.action
采集文件存放路径:
/data/gravel_spiders/ic_icpsp_spider

采集频率及采集策略

存量更新策略

db_host: bdp-rds-007.mysql.rds.aliyuncs.com
db_name: utn_ic
db_user: shuidi
db_password: 
数据库表名:tb_search_company_icpsp
全省的主体信息作为搜索条件
逐条更新
目前全量更新一轮即可.

增量采集策略

1.新成立的主体
2.补充的主体

爬虫

湖北--工商.亮照服务 ic_icpsp_spiders

责任人

郭本江

爬虫名称

ic_icpsp_spider

代码地址

项目地址:http://192.168.109.110/granite/project-gravel/-/tree/develop_ic_icpsp/scrapy_spiders/gravel_spiders/spiders/ic_icpsp_reqs/hub

队列名称及队列地址

  • redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
  • redis port: 6379
  • redis db: 7
  • redis key:
    • ic_icpsp_keys

优先级队列说明

  • ic_icpsp_keys 支持队列优先级

任务来源

taskhub 湖北的全量主体信息

任务输入参数(样例) 其中credit_no、province、company_name、company_code为必需

破码版本
{
   "company_name": "武汉鸿雁房地产代理有限公司",
   "company_code":  "4201002111621",
   "credit_no":  "",
   "province":  "HUB",
   "data_detail_url":  "http://xyjg.egs.gov.cn/ECPS_HB/company/detail.jspx?id=E8D47B3680F92E8CE588BF508DCB88BE",
   "company_major_type":  "3",
   "company_status":  "已吊销",
   "legal_person":  "刘桂强",
   "company_address":  "武昌区紫阳路147号",
   "create_time":  "2016-11-09 07:00:45.194000+00:00",
   "lastupdatetime":  "2021-03-28 14:57:26+00:00",
   "submit_time":  "",
   "authority":  "武汉市工商行政管理局",
   "n_company_status":  "吊销",
   "company_name_digest":  "3c7e408bcaedd3a960f0948ad81cbb74",
   "district_code":  ""
}

任务样例


{
	"data": {
		"icpsp_items": [
			{
				"search_keyword": "91420115MA49RG6T9L",
				"province": "HUB",
				"company_name": "武汉中肯商贸有限公司",
				"credit_no": "91420115MA49RG6T9L",
				"legal_person": "张耘",
				"establish_date": "2021-05-17",
				"detail_url": "/ICPSP/pdf.action?id=160000020925690093&lzcx=lzcx",
				"data_id": "160000020925690093",
				"company_address": "湖北省武汉市江夏区郑店街黄金南路8号汽车制造专用模具、夹具、检具生产线4号厂房",
				"company_type_code": "1152",
				"company_type": "有限责任公司(自然人投资或控股的法人独资)",
				"capital": 50,
				"company_code": "",
				"company_status": "",
				"legal_person_certno": "F3A10E8FD2D7882064EC67721F9DEBE9"
			}
		]
	},
	"http_code": 200,
	"error_msg": "",
	"task_result": 1000,
	"data_type": "detail",
	"spider_start_time": "2021-05-24 18:04:27.775",
	"spider_end_time": "2021-05-24 18:04:31",
	"task_params": {
		"province": "HUB",
		"credit_no": "91420115MA49RG6T9L",
		"company_code": "",
		"company_name": "武汉中肯商贸有限公司"
	},
	"metadata": {},
	"spider_name": "ic_icpsp_spider",
	"spider_ip": "10.8.0.26"
}

任务参数说明

破码版本
{
   "company_name": "武汉鸿雁房地产代理有限公司",
   "company_code":  "4201002111621",
   "credit_no":  "",
   "province":  "HUB",
   "data_detail_url":  "http://xyjg.egs.gov.cn/ECPS_HB/company/detail.jspx?id=E8D47B3680F92E8CE588BF508DCB88BE",
   "company_major_type":  "3",
   "company_status":  "已吊销",
   "legal_person":  "刘桂强",
   "company_address":  "武昌区紫阳路147号",
   "create_time":  "2016-11-09 07:00:45.194000+00:00",
   "lastupdatetime":  "2021-03-28 14:57:26+00:00",
   "submit_time":  "",
   "authority":  "武汉市工商行政管理局",
   "n_company_status":  "吊销",
   "company_name_digest":  "3c7e408bcaedd3a960f0948ad81cbb74",
   "district_code":  ""
}

data_type说明

detail: 详情信息

爬虫结果的超级数据

{
	"data": {
		"icpsp_items": [
			{
				"search_keyword": "91420115MA49RG6T9L",  --搜索关键词
				"province": "HUB",                       --省份 此处固定为HUB
				"company_name": "武汉中肯商贸有限公司",    -- 公司名称
				"credit_no": "91420115MA49RG6T9L",       -- 统一社会信用代码
				"legal_person": "张耘",                   -- 法人名称
				"establish_date": "2021-05-17",           --成立日期
				"detail_url": "/ICPSP/pdf.action?id=160000020925690093&lzcx=lzcx",                                         -- 详情url
				"data_id": "160000020925690093",          -- 详情url中的id参数
				"company_address": "湖北省武汉市江夏区郑店街黄金南路8号汽车制造专用模具、夹具、检具生产线4号厂房",                                                -- 登记地址
				"company_type_code": "1152",              -- 企业类型编码
				"company_type": "有限责任公司(自然人投资或控股的法人独资)", --企业类型
				"capital": 50,                            -- 注册资本
				"company_code": "",                       -- 注册号
				"company_status": "",                     -- 登记状态
				"legal_person_certno": "F3A10E8FD2D7882064EC67721F9DEBE9" --法人ID加密后的内容
			}
		]
	},
	"http_code": 200,
	"error_msg": "",
	"task_result": 1000,        -- 任务结果状态 1000成功 1101查询无结果 其它:任务爬取失败
	"data_type": "detail",
	"spider_start_time": "2021-05-24 18:04:27.775",
	"spider_end_time": "2021-05-24 18:04:31",
	"task_params": {
		"province": "HUB",
		"credit_no": "91420115MA49RG6T9L",
		"company_code": "",
		"company_name": "武汉中肯商贸有限公司"
	},
	"metadata": {},
	"spider_name": "ic_icpsp_spider",
	"spider_ip": "10.8.0.26"
}

实际爬虫结果的数据结构

{
	"data": {
		"icpsp_items": [
			{
				"search_keyword": "91422822MA490LYE01",
				"province": "HUB",
				"company_address": "建始县业州镇团结路新马路",
				"company_name": "湖北文图档案服务有限公司",
				"company_type_code": "1151",
				"company_type": "有限责任公司(自然人独资)",
				"establish_date": "2017-07-17",
				"capital": 80,
				"company_code": "",
				"credit_no": "91422822MA490LYE01",
				"legal_person": "侯贤才",
				"company_status": "",
				"legal_person_certno": "5B031B5292CCAF1EE26CAD8B09820982"
			}
		]
	},
	"http_code": 200,
	"error_msg": "",
	"task_result": 1000,
	"data_type": "detail",
	"spider_start_time": "2021-05-25 09:24:57.669",
	"spider_end_time": "2021-05-25 09:25:00",
	"task_params": {
		"company_name": "湖北文图档案服务有限公司",
		"company_code": "422822000035850",
		"credit_no": "91422822MA490LYE01",
		"province": "HUB",
		"data_detail_url": "http://xyjg.egs.gov.cn/ECPS_HB/company/detail.jspx?id=4F4A1397BB7B4B77A08DD1F6F5A1CE94",
		"company_major_type": "3",
		"company_status": "吊销,未注销",
		"legal_person": "侯贤才",
		"company_address": "建始县业州镇团结路新马路",
		"create_time": "2017-11-10 01:18:18+00:00",
		"lastupdatetime": "2021-03-23 00:44:22+00:00",
		"submit_time": "",
		"authority": "建始县市场监督管理局",
		"n_company_status": "吊销",
		"company_name_digest": "c1cda85ab5600d17b4b9f28784f41406",
		"district_code": "422822"
	},
	"metadata": {},
	"spider_name": "ic_icpsp_spider",
	"spider_ip": "10.8.6.51"
}

爬虫运行环境

scrapy

爬虫部署信息

ic_icpsp_spiders: 10.8.6.51 35个进程  

Taskhub地址

提交任务地址: http://10.8.6.222:18518/task/
代码编写地址: http://192.168.109.110/granite/project-gravel/blob/develop_app_10jqka_20210121/app_general_taxpayer/data_pump/general_taxpayer.yml

Taskhub调度规则说明

task_result=1000    # 正常获取到详情任务
task_result=1101    # 无结果信息
task_result=9101    # 超时错误,需要进行重试,目前重试5次
task_result=8000    # 参数错误

爬虫监控指标设计

(先观察,待补充)
索引: 
监控频率: 
监控起止时间: 
报警条件: 
报警群:  
报警内容: 

数据归集

责任人

范召贤

数据归集方式

  • 爬虫直接写kafka

  • 爬虫写文件logstash采集

爬虫结果目录

/data/gravel_spiders/ic_icpsp_spider

归集后存放目录

/data2_227/grvael_spider_result/ic_icpsp_spider

logstash配置文件名称

project-deploy/logstash/10.8.6.246/conf.d/collie_spider_data_to_kfk.conf(入topic)
project-deploy/logstash/10.8.6.229/conf.d/grvael_spider_to_es.conf(入es)

logstash文件采集type

type=>"ic_icpsp_spider"

数据归集的topic

topic_id => "general-taxpayer"

ES日志索引及筛选条件

index => "public-company-spider-data-%{log_date}"
{
  "query": {
    "match": {
      "spider_name.keyword": {
        "query": "ic_icpsp_spider",
        "type": "phrase"
      }
    }
  }
}

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • crontab + data_pump
  • supervisor + data_pump
  • supervisor + consumer

数据接收来源

数据存储表地址

  • 数据库地址:
  • 表名:
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages