基本信息
股权穿透QCC爬虫
equity_penetration_qcc,通过scrapy部署
项目名称:project-gravel
分支:develop_equity_penetration
数据名称(中文)
股权穿透QCC爬虫
数据英文名称
equity_penetration_qcc
采集网站(采集入口)
官网PC端入口:
https://www.qcc.com
采集文件存放路径:
/data/gravel_spiders/equity_penetration_qcc
采集频率及采集策略
存量更新策略
目前全量更新一轮地域与公司遍历
增量采集策略
爬虫
股权穿透QCC爬虫 equity_penetration_qcc
责任人
蒋家升
爬虫名称
equity_penetration_qcc
代码地址
项目地址: http://tech.pingansec.com/granite/project-gravel/-/tree/develop_equity_penetration
队列名称及队列地址
- redis host: redis://:utn@0818@bdp-mq-001.redis.rds.aliyuncs.com:6379/7
- redis port: 6379
- redis db: 7
- redis key:
- qcc
优先级队列说明
- equity_penetration 支持队列优先级
任务来源
任务输入参数(样例)
# 地域列表任务
{"area_code": "AH_340100", "page": "1"}
# 搜索列表任务
{"search_key": "北京出国邦出入境服务有限公司"}
# 详情页信息
{"fid": "0727d5d1a4f95d791ff4b7ce5d6e975a"}
任务样例
任务参数说明
- area_code: 省份/市区编码,例如:安徽(AH); 合肥(AH_340100)
- page: 页码
- search_key: 搜索框输入内容
- fid: QCC企业id
data_type说明
- list_region: 地域列表
- list_search: 搜索列表
- detail_company: 公司详情页信息
- detail_person: 个人详情页信息
爬虫结果的超级数据
同以下实际爬虫结果
实际爬虫结果的数据结构
- 地域列表任务结果
{
"data":
[
{
"fid": "13df1591b2302573e518c410acd7b2b4",
"qcc_url": "https://www.qcc.com/firm/13df1591b2302573e518c410acd7b2b4.html",
"company_name": "大渡口区玖贰辉荟服装经营部"
},
{
"fid": "b028024bb8010add7d668bed6e8b0079",
"qcc_url": "https://www.qcc.com/firm/b028024bb8010add7d668bed6e8b0079.html",
"company_name": "重庆心揽科技发展有限公司"
}
],
"http_code": 200,
"error_msg": "",
"task_result": 1000,
"data_type": "list_region",
"spider_start_time": "2021-11-24 22:41:29.584",
"spider_end_time": "2021-11-24 22:41:29",
"task_params": {"area_code": "CQ_500104","page": "5"},
"metadata": {"area_code": "CQ_500104","page": "5"},
"spider_name": "equity_penetration_qcc",
"spider_ip": "10.8.6.51"
}
- 公司页详情结果
{
"data":
{
"business_license":
{
"登记状态": "存续(在营、开业、在册)",
"成立日期": "2005-08-12",
"人员规模": "300-399人",
"曾用名": "-",
"进出口企业代码": "-",
"统一社会信用代码": "91310110779301025N",
"企业名称": "上海宽娱数码科技有限公司",
"注册资本": "50000万元人民币",
"实缴资本": "1680万元人民币",
"核准日期": "2021-11-16",
"组织机构代码": "77930102-5",
"工商注册号": "310110000371080",
"纳税人识别号": "91310110779301025N",
"企业类型": "有限责任公司(自然人独资)",
"营业期限": "2005-08-12至无固定期限",
"纳税人资质": "-",
"所属行业": "科技推广和应用服务业",
"所属地区": "上海市",
"登记机关": "杨浦区市场监督管理局",
"最新年报地址": "上海市杨浦区国定路335号2号楼1905室(2020年报)",
"经营范围": "许可项目:第一类增值电信业务;第二类增值电信业务;基础电信业务;出版物批发;出版物零售;餐饮服务;信息网络传播视听节目;网络文化经营;广播电视节目作经营;营业性演出。(依法须经批准的项目,经相关部门批准后方可开展经营活动,具体经营项目以相关部门批准文件或许可证件为准)一般项目:数码科技、计算机软硬件科技领域内的技术咨询、技术转让、技术开发、技术服务,广告发布(非广播电台、电视台、报刊出版单位),票务代理,计算机软硬件、日用百货、办公用品、工艺美术品(象牙及其品除外)、服装服饰、鞋帽、针纺织品、玩具、文化体育用品、家居用品、电子产品、通讯设备、宠物用品、化妆品、卫生洁具、家用电器、文化用品、皮革品、包装材料、珠宝首饰的销售。(除依法须经批准的项目外,凭营业执照依法自主开展经营活动)",
"法定代表人":
{
"legal_person": "陈睿",
"pid": "pdc2d22e33cabf11add23ddbc90fd62f"
},
"参保人数": "380",
"英文名": "ShanghaiKuanyuDigitalTechnologyCo.,Ltd.",
"注册地址": "上海市杨浦区政立路489号801室"
},
"main_members":
[
{
"职务": "执行董事,法定代表人",
"持股比例": "100%持股详情>",
"最终受益股份": "100%股权链>",
"姓名":
{
"member": "陈睿",
"pid": "pdc2d22e33cabf11add23ddbc90fd62f",
"tags": ["实际控制人","最终受益人","有股权出质","大股东"]
}
},
{
"职务": "监事",
"持股比例": "-",
"最终受益股份": "-",
"姓名":
{
"member": "李旎",
"pid": "p18ae8dbf5cfd395eb02eb536dd1e58a"
}
}
],
"shareholders":
[
{
"持股比例": "100%持股详情>",
"认缴出资额(万元)": "50000",
"认缴出资日期": "2041-08-20",
"参股日期": "2014-08-06",
"实缴出资额(万元)": "1680",
"实缴出资日期": "2009-10-19",
"股东及出资信息":
{
"shareholder": "陈睿",
"pid": "pdc2d22e33cabf11add23ddbc90fd62f",
"tags": ["大股东","实际控制人","最终受益人","有股权出质"]
}
}
],
"touzilist":
[
{
"注册资本": "1000万元人民币",
"成立日期": "2019-03-01",
"状态": "存续",
"持股比例": "100%",
"认缴出资额": "1000万元人民币",
"融资轮次": "-",
"融资日期": "-",
"关联产品/机构": "哔哩哔哩bilibili",
"被投资企业名称":
{
"invested_company": "海南红红火火信息科技有限公司",
"fid": "25ebe2f0466fffd9ce82df1705986658"
},
"法定代表人":
{
"legal_person": "郑彬炜",
"pid": "p71e12c94c13c44b208971209d3da792"
}
}
],
"company_pv": "18万+"
},
"http_code": 200,
"error_msg": "",
"task_result": 1000,
"data_type": "detail_company",
"spider_start_time": "2021-12-03 19:49:16.811",
"spider_end_time": "2021-12-03 19:49:26",
"task_params": {"fid": "78045ae17d1d9487163b97233b7477d2"},
"metadata": {"fid": "78045ae17d1d9487163b97233b7477d2"},
"spider_name": "equity_penetration_qcc",
"spider_ip": "10.8.1.30"
}
- 个人页详情结果
{
"data":
{
"legallist":
[
{
"KeyNo": "bf7a36cf53f8208141a5d9a2c68c3488",
"Name": "北京新东方大愚文化传播有限公司",
"OperName": "俞敏洪",
"OperPersonId": "p1b99d0e8a749a1a32c1e17c2d41d686",
"OperType": 1,
"RelatedCount": 83,
"RegCap": "2000万元人民币",
"ImageUrl": "https://image.qcc.com/logo/bf7a36cf53f8208141a5d9a2c68c3488.jpg?x-oss-process=style/logo_200",
"Date": 1053014400,
"Status": "存续",
"CoyCode": "",
"Relation":
[
{
"Type": "0",
"TypeDesc": "法定代表人",
"Value": "俞敏洪",
"StartDate": -1,
"EndDate": 0
},
{
"Type": "2",
"TypeDesc": "任职",
"Value": "总经理,执行董事",
"StartDate": -1,
"EndDate": 0
}
],
"Area":
{
"Province": "北京市",
"City": "北京市",
"County": "海淀区"
},
"Industry":
{
"IndustryCode": "R",
"Industry": "文化、体育和娱乐业",
"SubIndustryCode": "87",
"SubIndustry": "广播、电视、电影和录音制作业",
"MiddleCategoryCode": null,
"MiddleCategory": null,
"SmallCategoryCode": null,
"SmallCategory": null
},
"RegistCapiAmt": 2000,
"SXCount": 0,
"ZXCount": 0
}
],
"allcompanylist":
[
{
"KeyNo": "effab7edf99cd329486b6237266dd5cd",
"Name": "北京汇智博纳教育科技有限公司",
"OperName": "金利",
"OperPersonId": "p128d7ba6adfe5015ecfdabca188b802",
"OperType": 1,
"RelatedCount": 3,
"RegCap": "1000万元人民币",
"ImageUrl": "https://image.qcc.com/logo/effab7edf99cd329486b6237266dd5cd.jpg?x-oss-process=style/logo_200",
"Date": 1304611200,
"Status": "存续",
"CoyCode": "",
"Relation":
[
{
"Type": "1",
"TypeDesc": "股东",
"Value": "70.00%",
"StartDate": -1,
"EndDate": 0
},
{
"Type": "2",
"TypeDesc": "任职",
"Value": "监事",
"StartDate": -1,
"EndDate": 0
}
],
"Area":
{
"Province": "北京市",
"City": "北京市",
"County": "海淀区"
},
"Industry":
{
"IndustryCode": "M",
"Industry": "科学研究和技术服务业",
"SubIndustryCode": "75",
"SubIndustry": "科技推广和应用服务业",
"MiddleCategoryCode": "759",
"MiddleCategory": "其他科技推广服务业",
"SmallCategoryCode": "7590",
"SmallCategory": "其他科技推广服务业"
},
"RegistCapiAmt": 1000,
"SXCount": 0,
"ZXCount": 0
}
],
"investlist":
[
{
"KeyNo": "7a71aee12bf18701d3b1da8fa1a4bf5f",
"Name": "北京合力惠东投资中心(有限合伙)",
"OperName": "湖州恒益股权投资管理有限公司",
"OperPersonId": "39c826a638deececf9ac5f9097a1410c",
"OperType": 2,
"RelatedCount": 3,
"RegCap": "3236.319954万元人民币",
"ImageUrl": "https://image.qcc.com/auto/7a71aee12bf18701d3b1da8fa1a4bf5f.jpg?x-oss-process=style/logo_200",
"Date": 1328803200,
"Status": "存续",
"CoyCode": "",
"Relation":
[
{
"Type": "1",
"TypeDesc": "股东",
"Value": "15.45%",
"StartDate": 1355414400,
"EndDate": 0
}
],
"Area":
{
"Province": "北京市",
"City": "北京市",
"County": "海淀区"
},
"Industry":
{
"IndustryCode": "L",
"Industry": "租赁和商务服务业",
"SubIndustryCode": "72",
"SubIndustry": "商务服务业",
"MiddleCategoryCode": "721",
"MiddleCategory": "组织管理服务",
"SmallCategoryCode": "7212",
"SmallCategory": "投资与资产管理"
},
"RegistCapiAmt": 3236,
"SXCount": 0,
"ZXCount": 0
}
],
"postofficelist":
[
{
"KeyNo": "4bf81171baf9db38f0768c7e36cbe683",
"Name": "北京洪泰企业管理集团有限公司",
"OperName": "盛希泰",
"OperPersonId": "p84ef64a59e0ed69364c0e8732ea9c2d",
"OperType": 1,
"RelatedCount": 91,
"RegCap": "30000万元人民币",
"ImageUrl": "https://image.qcc.com/auto/4bf81171baf9db38f0768c7e36cbe683.jpg?x-oss-process=style/logo_200",
"Date": 1583769600,
"Status": "存续",
"CoyCode": "",
"Relation":
[
{
"Type": "1",
"TypeDesc": "股东",
"Value": "11.11%",
"StartDate": 1583769600,
"EndDate": 0
},
{
"Type": "2",
"TypeDesc": "任职",
"Value": "监事",
"StartDate": 1583769600,
"EndDate": 0
}
],
"Area":
{
"Province": "北京市",
"City": "北京市",
"County": "通州区"
},
"Industry":
{
"IndustryCode": "L",
"Industry": "租赁和商务服务业",
"SubIndustryCode": "72",
"SubIndustry": "商务服务业",
"MiddleCategoryCode": null,
"MiddleCategory": null,
"SmallCategoryCode": null,
"SmallCategory": null
},
"RegistCapiAmt": 30000,
"SXCount": 0,
"ZXCount": 0
}
]
},
"http_code": 200,
"error_msg": "",
"task_result": 1000,
"data_type": "detail_person",
"spider_start_time": "2021-12-03 19:10:56.001",
"spider_end_time": "2021-12-03 19:11:30",
"task_params": {"pid": "p1b99d0e8a749a1a32c1e17c2d41d686"},
"metadata": {"pid": "p1b99d0e8a749a1a32c1e17c2d41d686"},
"spider_name": "equity_penetration_qcc",
"spider_ip": "10.8.1.30"
}
爬虫运行环境
scrapy
爬虫部署信息
target: node_51
project: equity_penetration
spider_name: equity_penetration_qcc
Taskhub地址
提交任务地址:
代码编写地址:
Taskhub调度规则说明
task_result=1000 # 正常获取到详情任务
task_result=1101 # 无结果信息
task_result=9101 # 超时错误,需要进行重试,目前重试5次
task_result=8000 # 参数错误
爬虫监控指标设计
(先观察,待补充)
索引:
监控频率:
监控起止时间:
报警条件:
报警群:
报警内容:
数据归集
责任人
数据归集方式
-
爬虫直接写kafka
-
爬虫写文件logstash采集
爬虫结果目录
采集文件存放路径:
/data/gravel_spiders/equity_penetration_qcc
归集后存放目录
/data2_227/grvael_spider_result/equity_penetration_qcc
logstash配置文件名称
logstash文件采集type
equity_penetration_qcc
数据归集的topic
general-taxpayer
ES日志索引及筛选条件
gravel-spider-data-*
监控指标看板
数据保留策略
数据清洗
责任人
代码地址
部署地址
部署方法及说明
- crontab + data_pump
- supervisor + data_pump
- supervisor + consumer
数据接收来源
数据存储表地址
- 数据库地址:
- 表名: