Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
    • Company
  • xiaowei

Last edited by yuanbo Feb 21, 2022
Page history
This is an old version of this page. You can view the most recent version or browse the history.

xiaowei

基本信息

字段说明

class XiaoWeiItem(Item):
    company_name = Field()  # 企业名称
    credit_no = Field()     # 统一社会信用代码
    company_type = Field()  # 企业类型, 如:有限责任公司(自然人投资或控股的法人独资)
    establish_date = Field()    # 成立日期
    capital = Field()   # 注册资本
    authority = Field() # 登记机关
    industry_name = Field() # 所属门类
    industry = Field()  # 行业

状态码特别说明

1000 代表正常获取到数据
1101 代表未查找到对应的数据

数据名称(中文)

小微企业

数据英文名称

xiaowei

采集网站(采集入口)

http://xwqy.gsxt.gov.cn/

采集频率及采集策略

存量更新策略

从es中获取在营企业,具体更新要求,请查看本项目对应doc目录下的需求文档

增量采集策略


爬虫

xiaowei

责任人

袁波

爬虫名称

xiaowei

代码地址

项目地址:
http://tech.pingansec.com/granite/project-gravel/-/tree/xiaowei_20211228

## 队列名称及队列地址
<!--redis host port db key 优先级说明-->
-
* redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
* redis port: 6379
* redis db: 7
* redis key: 
    * xiaowei:10

任务输入参数(样例)

{
	"credit_no": "91530302397409612J",
	"company_name_digest": "0055c8d645205a8f1a46f7d2b94e5f76",
	"company_name": "曲靖市麒麟区乐友家庭农场",
	"company_code": "530302200120172",
	"_id": "0055c8d645205a8f1a46f7d2b94e5f76"
}

data_type说明

实际爬虫结果的数据结构

task_result=1000,正常获取到的1条结果

{
	"task_result": 1000,
	"data_type": "detail",
	"data": {
		"company_name": "曲靖市麒麟区乐友家庭农场",
		"credit_no": "91530302397409612J",
		"company_type": "个人独资企业",
		"establish_date": "2015年08月17日",
		"capital": "30万人民币",
		"authority": "麒麟 ",
		"industry_name": "农、林、牧、渔业",
		"industry": "蔬菜种植"
	},
	"http_code": 200,
	"error_msg": "",
	"metadata": {},
	"spider_start_time": "2022-01-04 15:11:23",
	"spider_end_time": "2022-01-04 15:11:40",
	"task_params": {
		"credit_no": "91530302397409612J",
		"company_name_digest": "0055c8d645205a8f1a46f7d2b94e5f76",
		"company_name": "曲靖市麒麟区乐友家庭农场",
		"company_code": "530302200120172",
		"_id": "0055c8d645205a8f1a46f7d2b94e5f76"
	},
	"spider_name": "xiaowei",
	"spider_ip": "10.8.6.4"
}

task_result=1101,并且没有获取到任何推荐结果,data为null,python中就对应为None

{
	"task_result": 1101,
	"data": null,
	"task_params": {
		"credit_no": "91440101MA9Y2Q9034",
		"company_name_digest": "08fa9d49f957ea2981eb222383eb3e02",
		"company_name": "广州弘旭投资合伙企业(有限合伙)",
		"company_code": "440106008209371",
		"_id": "08fa9d49f957ea2981eb222383eb3e02"
	},
	"data_type": "detail",
	"http_code": 200,
	"error_msg": "",
	"spider_name": "xiaowei",
	"spider_ip": "10.8.6.4"
}

task_result=1101,表示搜索未找匹配的结果,但获取到推荐数据

{
	"task_result": 1101,
	"data_type": "detail",
	"data": {
		"company_name": "曲靖市麒麟区乐友家庭农场",
		"credit_no": "91530302397409612J",
		"company_type": "个人独资企业",
		"establish_date": "2015年08月17日",
		"capital": "30万人民币",
		"authority": "麒麟 ",
		"industry_name": "农、林、牧、渔业",
		"industry": "蔬菜种植"
	},
	"http_code": 200,
	"error_msg": "",
	"metadata": {},
	"spider_start_time": "2022-01-04 15:11:23",
	"spider_end_time": "2022-01-04 15:11:40",
	"task_params": {
		"credit_no": "91530302397409612J",
		"company_name_digest": "0055c8d645205a8f1a46f7d2b94e5f76",
		"company_name": "曲靖市麒麟区乐友家庭农场",
		"company_code": "530302200120172",
		"_id": "0055c8d645205a8f1a46f7d2b94e5f76"
	},
	"spider_name": "xiaowei",
	"spider_ip": "10.8.6.4"
}

爬虫运行环境

scrapy

爬虫部署信息

crontab任务对应机器collie用户:   待添加
爬虫部署机器:  10.8.6.4   40个进程

Taskhub地址

暂未添加

Taskhub调度规则说明

爬虫监控指标设计

待完善

爬虫待采集结果目录

/data/gravel_spiders/xiaowei

数据归集

责任人

范召贤

数据归集方式

  • 爬虫直接写kafka

  • 爬虫写文件logstash采集

爬虫结果目录

/data/gravel_spiders/xiaowei

归集后存放目录

/data2_227/grvael_spider_result/xiaowei

logstash配置文件名称

project-deploy/logstash/10.8.6.246/conf.d/collie_spider_data_to_kfk.conf(入topic)
project-deploy/logstash/10.8.6.229/conf.d/grvael/grvael_spider_to_es.conf(入es)

logstash文件采集type

type=>"xiaowei"

数据归集的topic

topic_id => "general-taxpayer"

ES日志索引及筛选条件

index => "gravel-spider-data-%{log_date}"

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • crontab + data_pump
  • supervisor + data_pump
  • supervisor + consumer

数据接收来源

归集的文件

数据存储表地址

  • 数据库地址:
  • 库名:
  • 表名:
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages