Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
  • icp

Last edited by 蒋家升 Jan 17, 2022
Page history

icp

基本信息

icp备案爬虫
包含两种爬虫:
    - 找新
    - 例行/存量更新

数据名称(中文)

icp备案爬虫

数据英文名称

icp

采集网站(采集入口)

官网入口:
https://beian.miit.gov.cn/

采集数据存放路径:
    - 结果:爬虫结果直接存kafka

采集频率及采集策略

存量更新策略

计划一周全量更新一轮

增量采集策略

由找新爬虫内部逻辑调度
    - 主体号id递增逻辑 
    - 各省份主体备案号递增逻辑
每天采集官网当天全量

爬虫

icp备案爬虫 icp

责任人

蒋家升

爬虫名称

找新: icp_new
例行: icp_baxh

代码地址

项目地址: 
    - 找新爬虫: http://192.168.109.110/lucioYao/aicha-spider/-/tree/master/icp
    - 例行爬虫: http://tech.pingansec.com/granite/project-collie-app/-/tree/master/app_icp/udms

队列名称及队列地址

  • redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
  • redis port: 6379
  • redis db: 7
  • redis key:
    • collie_icp_baxh

优先级队列说明

  • icp 支持队列优先级

任务来源

找新爬虫

  • 主体号id递增逻辑
  • 各省份主体备案号递增逻辑

例行爬虫

mysql:
    host: bdp-rds-003.mysql.rds.aliyuncs.com
    port: 3306
    db: utn_data
    table: tb_icp_baxh_info;tb_icp_base_info

任务输入参数(样例)

例行爬虫

{
    "id": 4897410, 
    "company_name_digest": "1ecb0b4e31517d2e59ea7ad2e7d646bd", 
    "dwmc": "烟台博升环保科技有限公司", "dwxz": "企业", "zt_baxh": "鲁ICP备13028318号", 
    "ym_list": ",ytbstech.com"
}

任务样例

例行爬虫

select baxh.id, baxh.company_name_digest, baxh.dwmc, baxh.dwxz, baxh.zt_baxh, GROUP_CONCAT(base.ym SEPARATOR ',') ym_list
    from tb_icp_baxh_info baxh LEFT JOIN tb_icp_base_info base 
    on baxh.zt_baxh = base.zt_baxh 
    where baxh.LAST_UPDATE_STATUS=1 AND baxh.LAST_UPDATE_TIME<DATE_ADD(now(),INTERVAL - 21 DAY)
    GROUP BY baxh.id;

任务参数说明

data_type说明

当前没有data_type

爬虫结果的超级数据

同以下实际爬虫结果

实际爬虫结果的数据结构

找新爬虫

{
  "search_province": "粤",
  "result_code": 1,
  "result_msg": "查找icp成功",
  "search_name": "ztid",
  "search_value": 990000770576,
  "item_dataes":
  [
    {
      "dwmc": "广州市花都区花城瑞江贸易商行",
      "dwxz": "企业",
      "wz_baxh": "粤ICP备2021153599号-1",
      "wzmc": "爱六八",
      "site_url": "www.ai6ba.com",
      "shsj": "2021-11-12 11:53:20",
      "domain": "ai6ba.com",
      "website_owner": "",
      "ztid": 990000770576,
      "zt_baxh": "粤ICP备2021153599号",
      "wzid": 990001367507,
      "ymid": 990001366936
    }
  ],
  "get_time": "2021-11-12 11:53:27",
  "ztid": 990000770576,
  "zt_baxh": 990000770576,
  "wzmc": "",
  "ym": "",
  "dwmc": ""
}

例行爬虫

{
  "last_update_status": 1,
  "task_params":
  {
    "id": "4328358",
    "company_name_digest": "106b8f7f988d07cdcf07da7e25246d02",
    "dwmc": "楚胜汽车集团有限公司",
    "dwxz": "企业",
    "zt_baxh": "鄂ICP备13004305号",
    "ym_list": "xgcsgs.net"
  },
  "searchkey": "鄂ICP备13004305号",
  "item_datas":
  [
    {
      "dwmc": "楚胜汽车集团有限公司",
      "ztid": 10000600349,
      "dwxz": "企业",
      "zt_baxh": "鄂ICP备13004305号",
      "wz_baxh": "鄂ICP备13004305号-1016",
      "wzid": 990000257356,
      "wzmc": "湖北楚胜汽车有限公司",
      "wzfzr": "",
      "site_url": "www.zycfxx.com",
      "ym": "zycfxx.com",
      "ymid": 990000257218,
      "shsj": "2021-07-29 09:34:35",
      "nrlx": "",
      "xzjr": "否",
      "oper": "A"
    },
    {
      "dwmc": "楚胜汽车集团有限公司",
      "ztid": 10000600349,
      "dwxz": "企业",
      "zt_baxh": "鄂ICP备13004305号",
      "wz_baxh": "鄂ICP备13004305号-1044",
      "wzid": 990000534990,
      "wzmc": "楚胜汽车集团有限公司",
      "wzfzr": "",
      "site_url": "www.csygc.cn",
      "ym": "csygc.cn",
      "ymid": 990000534665,
      "shsj": "2021-07-29 09:34:36",
      "nrlx": "",
      "xzjr": "否",
      "oper": "A"
    },
    {
      "zt_baxh": "鄂ICP备13004305号",
      "ym": "xgcsgs.net",
      "oper": "D"
    }, ...
  ],
  "search_province": "all",
  "result_msg": "success",
  "last_update_time": "2021-11-24 15:12:10",
  "timecost": 14.668023824691772,
  "id": "4328358",
  "zt_baxh": "鄂ICP备13004305号",
  "ym": "",
  "dwxz": "企业",
  "dwmc": "楚胜汽车集团有限公司",
  "last_update_total": 175,
  "is_valid": 1,
  "sync_condition":
  {
    "operation": "upsert",
    "data_type": "icp_baxh"
  }
}

爬虫运行环境

udm

爬虫部署信息

找新爬虫

target: 10.8.10.63~76
spider_name: icp_new

例行爬虫

target: 10.8.6.39; 10.8.6.46
spider_name: icp_baxh

Taskhub地址

没配置taskhub

Taskhub调度规则说明

task_result=1000    # 正常获取到详情任务
task_result=1101    # 无结果信息
task_result=9101    # 超时错误,需要进行重试,目前重试5次
task_result=8000    # 参数错误

爬虫监控指标设计

(先观察,待补充)
索引: 
监控频率: 
监控起止时间: 
报警条件: 
报警群:  
报警内容: 

数据归集

责任人

数据归集方式

  • 爬虫直接写kafka

  • 爬虫写文件logstash采集

爬虫结果目录

采集文件存放路径:
/data/gravel_spiders/icp

归集后存放目录

logstash配置文件名称

logstash文件采集type

数据归集的topic

ES日志索引及筛选条件

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • crontab + data_pump
  • supervisor + data_pump
  • supervisor + consumer

数据接收来源

数据存储表地址

  • 数据库地址:
  • 表名:
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages