基本信息
数据名称(中文)
图片下载
数据英文名称
picture_download
采集网站(采集入口)
采集频率及采集策略
根据具体任务决定
存量更新策略
增量采集策略
爬虫
责任人
董雨峰
爬虫名称
picture_download
代码地址
队列名称及队列地址
- redis host: bdp-mq-001.redis.rds.aliyuncs.com
- redis port: 6379
- redis db: 7
- redis key: download_picture_url
优先级队列说明
任务来源
- 由使用方掉taskhub 接口输入
任务输入参数(样例)
{
"spider_name":"picture_download",
"key": "",
"url": "",
"bucket": "patent",
"proxy":1,
}
任务样例
{
"key": "",
"url": "",
"bucket": "patent",
"proxy":1,
}
任务参数说明
"key": 图片唯一键
"url": 图片url
"bucket": 图片来源
"proxy": 是否使用代理,1为使用,0为不使用,默认为1
data_type说明
爬虫结果的超级数据
{
"spider_name": "picture_download",
"platform_name": "picture",
"http_code": 200,
"error_msg": "successful",
"task_result": 1000,
"data_type": "",
"bucket": "patent",
"spider_start_time": "2021-10-13 15:48:56",
"spider_end_time": "2021-10-13 15:49:08",
"spider_used_time_ms": 12,
"spider_ip": "10.8.6.30",
"task_params": {
"key": "CN204671179U",
"url": "http://qxb-img.oss-cn-hangzhou.aliyuncs.com/dlpatents/009c730db9230747893f2356324376ba.jpg",
"bucket": "patent"
},
"metadata": {},
"data": {
"key": "CN204671179U",
"bucket": "patent",
"store_path": "patent/ff/b1/d0/ffb1d035d18b1d8a37ad2ac54218adb9.jpg",
"content": "",
"basket_host": "10.8.8.59:31010"
}
}
实际爬虫结果的数据结构
{
"spider_name": "picture_download",
"platform_name": "picture",
"http_code": 200,
"error_msg": "successful",
"task_result": 1000,
"data_type": "",
"bucket": "patent",
"spider_start_time": "2021-10-13 15:48:56",
"spider_end_time": "2021-10-13 15:49:08",
"spider_used_time_ms": 12,
"spider_ip": "10.8.6.30",
"task_params": {
"key": "CN204671179U",
"url": "http://qxb-img.oss-cn-hangzhou.aliyuncs.com/dlpatents/009c730db9230747893f2356324376ba.jpg",
"bucket": "patent"
},
"metadata": {},
"data": {
"key": "CN204671179U",
"bucket": "patent",
"store_path": "patent/ff/b1/d0/ffb1d035d18b1d8a37ad2ac54218adb9.jpg",
"content": "",
"basket_host": "10.8.8.59:31010"
}
}
爬虫运行环境
udm
爬虫部署信息
部署机器:10.8.6.30
进程数:1
项目名称:app_picture_download
Taskhub地址
- 10.8.6.222
Taskhub调度规则说明
submit_task:
class: http.HttpRequestWriter
init:
url: "http://10.8.6.222:8526/task/"
method: post
爬虫监控指标设计
爬虫待采集结果目录
数据归集
责任人
数据归集方式
-
爬虫直接写kafka
-
爬虫写文件logstash采集
归集后存放目录
/data2_227/grvael_spider_result/picture_download
logstash配置文件名称
logstash文件采集type
数据归集的topic
collie_picture_download
ES日志索引及筛选条件
监控指标看板
数据保留策略
数据清洗
责任人
代码地址
部署地址
部署方法及说明
- crontab + data_pump
- supervisor + data_pump
- supervisor + consumer
数据接收来源
数据存储表地址
- 数据库地址:
- 表名: