Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
    • General_taxpayer
  • A_taxpayer

Last edited by 蒋家升 May 06, 2022
Page history

A_taxpayer

基本信息

A级纳税人
A_taxpayer,通过scrapy部署
项目名称:project-gravel
分支:develop_general_taxpayer

数据名称(中文)

A级纳税人

数据英文名称

A_taxpayer

采集网站(采集入口)

post http://hd.chinatax.gov.cn/service/findCredit.do

采集频率及采集策略

存量更新策略

每个月跑一轮,已经包括找新和全量更新

增量采集策略

遍历各省code和page_num,page_num来源于响应

爬虫

A级纳税人 A_taxpayer

责任人

杨龙斌

爬虫名称

A_taxpayer

代码地址

项目地址:http://tech.pingansec.com/granite/project-gravel/-/blob/develop_general_taxpayer/scrapy_spiders/gravel_spiders/spiders/A_taxpayer_spider.py

队列名称及队列地址

  • redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
  • redis port: 6379
  • redis db: 7
  • redis key:
    • A_taxpayer

优先级队列说明

  • A_taxpayer 不支持队列优先级

任务来源

province_page_num = [
    ['320000', 21116], ['330000', 17073], ['440000', 16448], ['310000', 15221], ['370000', 12314],
    ['440300', 9878], ['110000', 9279], ['130000', 8472], ['510000', 7033], ['410000', 6574],
    ['420000', 5455], ['350000', 4723], ['330200', 4464], ['430000', 4147], ['610000', 4006],
    ['120000', 3852], ['340000', 3628], ['210000', 3209], ['370200', 3174], ['450000', 2675],
    ['140000', 2319], ['360000', 2205], ['530000', 2184], ['650000', 2154], ['350200', 1615],
    ['210200', 1605], ['620000', 1563], ['150000', 1518], ['520000', 1371], ['230000', 1088],
    ['220000', 993], ['640000', 860], ['500000', 763], ['460000', 643], ['630000', 379],
    ['540000', 94]
]

任务输入参数(样例)

{
  "province_code": "540000",
  "page_num": 1
}

任务样例

{"province_code":"540000", "page_num":1}

任务参数说明

{
  "province_code": "省份代码",
  "page_num": "页数"
}

data_type说明

detail: 详情信息

爬虫结果的超级数据

{
  "data": {
    "content": [
      {
        "code": "913306027044810927",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 15985975,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市万通纺织有限公司"
      },
      {
        "code": "91330602704481543H",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16010777,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市红日装饰广告有限公司"
      },
      {
        "code": "91330602704481703Q",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16010779,
        "level": "A类",
        "location": "330000",
        "name": "中国人民财产保险股份有限公司绍兴市分公司"
      },
      {
        "code": "91330602704481738B",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018053,
        "level": "A类",
        "location": "330000",
        "name": "绍兴正泰电器销售有限公司"
      },
      {
        "code": "913306027044824231",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 15985977,
        "level": "A类",
        "location": "330000",
        "name": "绍兴宝兴化工有限公司"
      },
      {
        "code": "913306027044829681",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 15959911,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市华众空压机有限公司"
      },
      {
        "code": "91330602704482976U",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 15985979,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市繁特利轴承有限公司"
      },
      {
        "code": "91330602704482992H",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16010781,
        "level": "A类",
        "location": "330000",
        "name": "浙江诚锦信息技术有限公司"
      },
      {
        "code": "91330602704483012D",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018055,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市越城雄丰物资销售部"
      },
      {
        "code": "913306027044830988",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018315,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市越城海荣机电物资供应站"
      },
      {
        "code": "9133060270448332X3",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018915,
        "level": "A类",
        "location": "330000",
        "name": "绍兴菲尼克进出口有限公司"
      },
      {
        "code": "91330602704483426H",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16010783,
        "level": "A类",
        "location": "330000",
        "name": "绍兴中国轻纺城贸易发展有限公司"
      },
      {
        "code": "91330602704483581X",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018917,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市越城铸造有限公司"
      },
      {
        "code": "9133060270448375XW",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018321,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市城区三星汽车配件有限公司"
      },
      {
        "code": "913306027044838646",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 15959913,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市上工机电设备有限公司"
      }
    ],
    "empty": false,
    "facets": [],
    "first": false,
    "last": false,
    "maxScore": null,
    "number": 10993,
    "numberOfElements": 15,
    "pageable": {
      "offset": 164895,
      "pageNumber": 10993,
      "pageSize": 15,
      "paged": true,
      "sort": {
        "empty": true,
        "sorted": false,
        "unsorted": true
      },
      "unpaged": false
    },
    "size": 15,
    "sort": {
      "$ref": "$.pageable.sort"
    },
    "totalElements": 256082,
    "totalPages": 17073
  },
  "http_code": 200,
  "error_msg": "",
  "task_result": 1000,
  "data_type": "detail",
  "spider_start_time": "2021-10-20 11:53:09.266",
  "spider_end_time": "2021-10-20 11:53:10",
  "task_params": {
    "province_code": "330000",
    "page_num": 10993,
    "warmhole_sessionid": "507327c8-a2e0-4976-a481-13bfc1f3165b"
  },
  "metadata": {},
  "spider_name": "A_taxpayer",
  "spider_ip": "10.8.6.51"
}

实际爬虫结果的数据结构

{
  "data": {
    "content": [
      {
        "code": "913306027044810927",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 15985975,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市万通纺织有限公司"
      },
      {
        "code": "91330602704481543H",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16010777,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市红日装饰广告有限公司"
      },
      {
        "code": "91330602704481703Q",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16010779,
        "level": "A类",
        "location": "330000",
        "name": "中国人民财产保险股份有限公司绍兴市分公司"
      },
      {
        "code": "91330602704481738B",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018053,
        "level": "A类",
        "location": "330000",
        "name": "绍兴正泰电器销售有限公司"
      },
      {
        "code": "913306027044824231",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 15985977,
        "level": "A类",
        "location": "330000",
        "name": "绍兴宝兴化工有限公司"
      },
      {
        "code": "913306027044829681",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 15959911,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市华众空压机有限公司"
      },
      {
        "code": "91330602704482976U",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 15985979,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市繁特利轴承有限公司"
      },
      {
        "code": "91330602704482992H",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16010781,
        "level": "A类",
        "location": "330000",
        "name": "浙江诚锦信息技术有限公司"
      },
      {
        "code": "91330602704483012D",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018055,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市越城雄丰物资销售部"
      },
      {
        "code": "913306027044830988",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018315,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市越城海荣机电物资供应站"
      },
      {
        "code": "9133060270448332X3",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018915,
        "level": "A类",
        "location": "330000",
        "name": "绍兴菲尼克进出口有限公司"
      },
      {
        "code": "91330602704483426H",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16010783,
        "level": "A类",
        "location": "330000",
        "name": "绍兴中国轻纺城贸易发展有限公司"
      },
      {
        "code": "91330602704483581X",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018917,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市越城铸造有限公司"
      },
      {
        "code": "9133060270448375XW",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 16018321,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市城区三星汽车配件有限公司"
      },
      {
        "code": "913306027044838646",
        "createTime": 1619758800000,
        "delFlag": "0",
        "evalyear": "2020",
        "id": 15959913,
        "level": "A类",
        "location": "330000",
        "name": "绍兴市上工机电设备有限公司"
      }
    ],
    "empty": false,
    "facets": [],
    "first": false,
    "last": false,
    "maxScore": null,
    "number": 10993,
    "numberOfElements": 15,
    "pageable": {
      "offset": 164895,
      "pageNumber": 10993,
      "pageSize": 15,
      "paged": true,
      "sort": {
        "empty": true,
        "sorted": false,
        "unsorted": true
      },
      "unpaged": false
    },
    "size": 15,
    "sort": {
      "$ref": "$.pageable.sort"
    },
    "totalElements": 256082,
    "totalPages": 17073
  },
  "http_code": 200,
  "error_msg": "",
  "task_result": 1000,
  "data_type": "detail",
  "spider_start_time": "2021-10-20 11:53:09.266",
  "spider_end_time": "2021-10-20 11:53:10",
  "task_params": {
    "province_code": "330000",
    "page_num": 10993,
    "warmhole_sessionid": "507327c8-a2e0-4976-a481-13bfc1f3165b"
  },
  "metadata": {},
  "spider_name": "A_taxpayer",
  "spider_ip": "10.8.6.51"
}

爬虫运行环境

scrapy

爬虫部署信息

general_taxpayer: 10.8.6.51 1个进程  

Taskhub地址

Taskhub调度规则说明

task_result=1000    # 正常获取到详情任务
task_result=1101    # 无结果信息
task_result=9101    # 超时错误,需要进行重试,目前重试5次
task_result=8000    # 参数错误

爬虫监控指标设计

(先观察,待补充)
索引: 
监控频率: 
监控起止时间: 
报警条件: 
报警群:  
报警内容: 

数据归集

责任人

范召贤

数据归集方式

  • 爬虫直接写kafka

  • 爬虫写文件logstash采集

爬虫结果目录

/data/gravel_spiders/A_taxpayer

归集后存放目录

/data2_227/grvael_spider_result/A_taxpayer

logstash配置文件名称

logstash文件采集type

type=>"a_taxpayer"

数据归集的topic

topic_id => "general-taxpayer"

ES日志索引及筛选条件

public-company-spider-data-*

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • crontab + data_pump
  • supervisor + data_pump
  • supervisor + consumer

数据接收来源

数据存储表地址

  • 数据库地址:
  • 表名:
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages