Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
  • aiqicha

Last edited by fanzx Nov 16, 2021
Page history
This is an old version of this page. You can view the most recent version or browse the history.

aiqicha

基本信息

数据维度

1. 基本信息
2. 股东信息
3. 高管信息

数据名称(中文)

爱企查

数据英文名称

aiqicha_search_company

采集网站(采集入口)

搜索页: https://aiqicha.baidu.com/
详情页: https://aiqicha.baidu.com/company_detail_31370200772422

采集频率及采集策略

存量更新策略

此次先跑完在营企业4000w+

增量采集策略


爬虫

爱企查   aiqicha_search_company

责任人

袁波

爬虫名称

aiqicha_search_company

代码地址

项目地址:http://192.168.109.110/granite/project-gravel/-/tree/aiqicha_20211112


## 队列名称及队列地址
<!--redis host port db key 优先级说明-->
-
* redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
* redis port: 6379
* redis db: 7
* redis key: 
    * aiqicha_search_company

任务输入参数(样例)

{
  "company_name": "眉山市彭山百盛农业发展有限公司",
  "credit_no": "91511422689947471C"
}

任务参数说明

data_type说明

爬虫结果的超级数据

实际爬虫结果的数据结构

{
    "data": [{
        "baseinfo": {
            "legal_person": "王森",
            "legal_person_url": "/person?personId=5f075668eb4d758c04ddc362b5aabb6d&subtab=personal-allenterprises",
            "company_status": "开业",
            "capital": "1,000万(元)",
            "stock_realcapital": "-",
            "history_name": "-",
            "industries_name": "房地产业",
            "credit_no": "91310114MA1GYHF51D",
            "tax_code": "-",
            "company_code": "91310114MA1GYHF51D",
            "company_name": "上海皓辰祥和房地产开发有限公司",
            "org_code": "MA1GYHF5-1",
            "authority": "嘉定区市场监督管理局",
            "establish_date": "2021-07-29",
            "company_type": "有限责任公司(自然人投资或控股)",
            "operation_startdate": "2021-07-29",
            "operation_enddate": "2051-07-28",
            "issue_date": "2021-07-29",
            "company_address": "上海市嘉定区封周路655号14幢201室JT7994",
            "business_scope": "许可项目:房地产开发经营;建设工程设计;各类工程建设活动;建设工程监理;住宅室内装饰装修;工程造价咨询业务。(依法须经批准的项目,经相关部门批准后方可开展经营活动,具体经营项目以相关部门批准文件或许可证件为准) 一般项目:信息咨询服务(不含许可类信息咨询服务);物业管理;园林绿化工程施工;五金产品零售;建筑材料销售;金属材料销售;机械设备销售。(除依法须经批准的项目外,凭营业执照依法自主开展经营活动)",
            "claim_status": "我要认领"
        },
        "partners": [{
            "partner_no": 1,
            "stock_name": "上海辰景企业发展有限公司",
            "partner_url": "",
            "stock_proportion": "85%",
            "stock_capital": "850万(元)",
            "stock_realcapital": "-"
        },
        {
            "partner_no": 2,
            "stock_name": "苏州万昇行科技有限公司",
            "partner_url": "",
            "stock_proportion": "15%",
            "stock_capital": "150万(元)",
            "stock_realcapital": "-"
        }],
        "employees": [{
            "employee_no": 1,
            "employee_name": "王森",
            "employee_url": "/person?personId=5f075668eb4d758c04ddc362b5aabb6d",
            "position": "执行董事"
        },
        {
            "employee_no": 2,
            "employee_name": "张攀",
            "employee_url": "/person?personId=2e123b59ec0cfdc5b4611ce67478315a",
            "position": "监事"
        }]
    }],
    "http_code": 200,
    "error_msg": "",
    "task_result": 1000,
    "data_type": "detail",
    "spider_start_time": "2021-11-15 18:38:46.785",
    "spider_end_time": "2021-11-15 18:38:48.260",
    "task_params": {
        "company_name": "上海皓辰祥和房地产开发有限公司",
        "credit_no": "91310114MA1GYHF51D"
    },
    "metadata": {},
    "spider_name": "aiqicha_search_company",
    "spider_ip": "10.8.1.50",
    "proxy_ip": "http://10.8.6.219:1805"
}

爬虫运行环境

scrapy

爬虫部署信息

crontab任务对应机器collie用户:   10.8.6.63
爬虫部署机器:  10.8.6.62   20个进程

Taskhub地址

提交任务地址:http://10.8.6.222:8526/inbound/public_company_spider_data/check_task/

Taskhub调度规则说明

task_result=9110    # 非预期状态码
task_result=9201    # 出现异常提示信息

爬虫监控指标设计

待定

数据归集

责任人

范召贤

数据归集方式

  • 爬虫直接写kafka

  • 爬虫写文件logstash采集

爬虫结果目录

/data/gravel_spiders/aiqicha_search_company

归集后存放目录

/data2_227/grvael_spider_result/aiqicha_search_company

logstash配置文件名称

project-deploy/logstash/10.8.6.246/conf.d/collie_spider_data_to_kfk.conf(入topic)
project-deploy/logstash/10.8.6.229/conf.d/grvael_spider_to_es.conf(入es)

logstash文件采集type

type=>"aiqicha-spider-data"

数据归集的topic

topic_id => "public-company-spider-data"

ES日志索引及筛选条件

index => "public-company-spider-data-%{log_date}"

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • crontab + data_pump
  • supervisor + data_pump
  • supervisor + consumer

数据接收来源

数据存储表地址

  • 数据库地址:
  • 表名:
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages