Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
  • icp

Last edited by 蒋家升 Jan 17, 2022
Page history
This is an old version of this page. You can view the most recent version or browse the history.

icp

基本信息

icp备案爬虫
包含两种爬虫:
    - 找新
    - 例行/存量更新

数据名称(中文)

icp备案爬虫

数据英文名称

icp

采集网站(采集入口)

官网入口:
https://beian.miit.gov.cn/

采集数据存放路径:
    - 结果:爬虫结果直接存kafka

采集频率及采集策略

存量更新策略

计划一周全量更新一轮

增量采集策略

由找新爬虫内部逻辑调度
    - 主体号id递增逻辑 
    - 各省份主体备案号递增逻辑
每天采集官网当天全量

爬虫

icp备案爬虫 icp

责任人

蒋家升

爬虫名称

找新: icp_new
例行: icp_baxh

代码地址

项目地址: 
    - 找新爬虫: http://192.168.109.110/lucioYao/aicha-spider/-/tree/master/icp
    - 例行爬虫: http://tech.pingansec.com/granite/project-collie-app/-/tree/master/app_icp/udms

队列名称及队列地址

  • redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
  • redis port: 6379
  • redis db: 7
  • redis key:
    • collie_icp_baxh

优先级队列说明

  • icp 支持队列优先级

任务来源

找新爬虫

  • 主体号id递增逻辑
  • 各省份主体备案号递增逻辑

例行爬虫

mysql:
    host: bdp-rds-003.mysql.rds.aliyuncs.com
    port: 3306
    db: utn_data
    table: tb_icp_baxh_info;tb_icp_base_info

任务输入参数(样例)

{
    "id": 4897410, 
    "company_name_digest": "1ecb0b4e31517d2e59ea7ad2e7d646bd", 
    "dwmc": "烟台博升环保科技有限公司", "dwxz": "企业", "zt_baxh": "鲁ICP备13028318号", 
    "ym_list": ",ytbstech.com"
}

任务样例

例行爬虫

select baxh.id, baxh.company_name_digest, baxh.dwmc, baxh.dwxz, baxh.zt_baxh, GROUP_CONCAT(base.ym SEPARATOR ',') ym_list
    from tb_icp_baxh_info baxh LEFT JOIN tb_icp_base_info base 
    on baxh.zt_baxh = base.zt_baxh 
    where baxh.LAST_UPDATE_STATUS=1 AND baxh.LAST_UPDATE_TIME<DATE_ADD(now(),INTERVAL - 21 DAY)
    GROUP BY baxh.id;

任务参数说明

data_type说明

当前没有data_type

爬虫结果的超级数据

同以下实际爬虫结果

实际爬虫结果的数据结构

找新爬虫

{
  "search_province": "粤",
  "result_code": 1,
  "result_msg": "查找icp成功",
  "search_name": "ztid",
  "search_value": 990000770576,
  "item_dataes":
  [
    {
      "dwmc": "广州市花都区花城瑞江贸易商行",
      "dwxz": "企业",
      "wz_baxh": "粤ICP备2021153599号-1",
      "wzmc": "爱六八",
      "site_url": "www.ai6ba.com",
      "shsj": "2021-11-12 11:53:20",
      "domain": "ai6ba.com",
      "website_owner": "",
      "ztid": 990000770576,
      "zt_baxh": "粤ICP备2021153599号",
      "wzid": 990001367507,
      "ymid": 990001366936
    }
  ],
  "get_time": "2021-11-12 11:53:27",
  "ztid": 990000770576,
  "zt_baxh": 990000770576,
  "wzmc": "",
  "ym": "",
  "dwmc": ""
}

例行爬虫

{
  "last_update_status": 1,
  "task_params":
  {
    "id": "4328358",
    "company_name_digest": "106b8f7f988d07cdcf07da7e25246d02",
    "dwmc": "楚胜汽车集团有限公司",
    "dwxz": "企业",
    "zt_baxh": "鄂ICP备13004305号",
    "ym_list": "xgcsgs.net"
  },
  "searchkey": "鄂ICP备13004305号",
  "item_datas":
  [
    {
      "dwmc": "楚胜汽车集团有限公司",
      "ztid": 10000600349,
      "dwxz": "企业",
      "zt_baxh": "鄂ICP备13004305号",
      "wz_baxh": "鄂ICP备13004305号-1016",
      "wzid": 990000257356,
      "wzmc": "湖北楚胜汽车有限公司",
      "wzfzr": "",
      "site_url": "www.zycfxx.com",
      "ym": "zycfxx.com",
      "ymid": 990000257218,
      "shsj": "2021-07-29 09:34:35",
      "nrlx": "",
      "xzjr": "否",
      "oper": "A"
    },
    {
      "dwmc": "楚胜汽车集团有限公司",
      "ztid": 10000600349,
      "dwxz": "企业",
      "zt_baxh": "鄂ICP备13004305号",
      "wz_baxh": "鄂ICP备13004305号-1044",
      "wzid": 990000534990,
      "wzmc": "楚胜汽车集团有限公司",
      "wzfzr": "",
      "site_url": "www.csygc.cn",
      "ym": "csygc.cn",
      "ymid": 990000534665,
      "shsj": "2021-07-29 09:34:36",
      "nrlx": "",
      "xzjr": "否",
      "oper": "A"
    },
    {
      "zt_baxh": "鄂ICP备13004305号",
      "ym": "xgcsgs.net",
      "oper": "D"
    }, ...
  ],
  "search_province": "all",
  "result_msg": "success",
  "last_update_time": "2021-11-24 15:12:10",
  "timecost": 14.668023824691772,
  "id": "4328358",
  "zt_baxh": "鄂ICP备13004305号",
  "ym": "",
  "dwxz": "企业",
  "dwmc": "楚胜汽车集团有限公司",
  "last_update_total": 175,
  "is_valid": 1,
  "sync_condition":
  {
    "operation": "upsert",
    "data_type": "icp_baxh"
  }
}

爬虫运行环境

udm

爬虫部署信息

找新爬虫

target: 10.8.10.63~76
spider_name: icp_new

例行爬虫

target: 10.8.6.39; 10.8.6.46
spider_name: icp_baxh

Taskhub地址

没配置taskhub

Taskhub调度规则说明

task_result=1000    # 正常获取到详情任务
task_result=1101    # 无结果信息
task_result=9101    # 超时错误,需要进行重试,目前重试5次
task_result=8000    # 参数错误

爬虫监控指标设计

(先观察,待补充)
索引: 
监控频率: 
监控起止时间: 
报警条件: 
报警群:  
报警内容: 

数据归集

责任人

数据归集方式

  • 爬虫直接写kafka

  • 爬虫写文件logstash采集

爬虫结果目录

采集文件存放路径:
/data/gravel_spiders/icp

归集后存放目录

logstash配置文件名称

logstash文件采集type

数据归集的topic

ES日志索引及筛选条件

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • crontab + data_pump
  • supervisor + data_pump
  • supervisor + consumer

数据接收来源

数据存储表地址

  • 数据库地址:
  • 表名:
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages