... | ... | @@ -80,7 +80,8 @@ risk_court_notice |
|
|
## 代码地址
|
|
|
|
|
|
```buildoutcfg
|
|
|
项目地址:http://tech.pingansec.com/granite/project-judicature/-/tree/dev_court_notice
|
|
|
项目地址:
|
|
|
http://tech.pingansec.com/granite/project-judicature/-/tree/dev_court_notice
|
|
|
```
|
|
|
|
|
|
## 队列名称及队列地址
|
... | ... | @@ -99,20 +100,36 @@ risk_court_notice |
|
|
|
|
|
<!--说明爬虫任务的输入。如:来自某个数据库表等。如果来自某个数据库表则应该简要说明该表内的数据是如何维护的。-->
|
|
|
|
|
|
```buildoutcfg
|
|
|
对于已经写的爬虫,会在对应频率的文件下添加如下一行
|
|
|
涉及url,是否翻页
|
|
|
{"url_split": ["http://hnzzy.chinacourt.gov.cn/article/index/id/MzQ1NTBINiAOAAA/page/", "####", ".shtml"], "start_index": 1, "pages": 5, "task_type": "开庭公告", "method": "GET", "this_page": 3}
|
|
|
### 对于已经写的爬虫,会在对应频率的文件下添加如下一行涉及url,是否翻页
|
|
|
```json
|
|
|
{
|
|
|
"url_split": ["http://hnzzy.chinacourt.gov.cn/article/index/id/MzQ1NTBINiAOAAA/page/", "####", ".shtml"],
|
|
|
"start_index": 1,
|
|
|
"pages": 5,
|
|
|
"task_type": "开庭公告",
|
|
|
"method": "GET",
|
|
|
"this_page": 3
|
|
|
}
|
|
|
```
|
|
|
|
|
|
### 任务样例
|
|
|
|
|
|
<!--注意是爬虫拿到的完整任务,而不仅仅是task_params-->
|
|
|
|
|
|
```buildoutcfg
|
|
|
{"url_split": ["http://hnzzy.chinacourt.gov.cn/article/index/id/MzQ1NTBINiAOAAA/page/", "####", ".shtml"], "start_index": 1, "pages": 5, "task_type": "开庭公告", "method": "GET", "this_page": 3}
|
|
|
```json
|
|
|
{
|
|
|
"url_split": ["http://hnzzy.chinacourt.gov.cn/article/index/id/MzQ1NTBINiAOAAA/page/", "####", ".shtml"],
|
|
|
"start_index": 1,
|
|
|
"pages": 5,
|
|
|
"task_type": "开庭公告",
|
|
|
"method": "GET",
|
|
|
"this_page": 3
|
|
|
}
|
|
|
```
|
|
|
|
|
|
### 专门针对人民法院公告网和庭审直播网站的任务样例
|
|
|
|
|
|
专门针对人民法院公告网和庭审直播网站的任务样例
|
|
|
```json
|
|
|
{
|
|
|
"start_index": 1,
|
|
|
"pages": 1,
|
... | ... | @@ -142,19 +159,35 @@ risk_court_notice |
|
|
|
|
|
<!--可能产生的data_type说明-->
|
|
|
|
|
|
```buildoutcfg
|
|
|
```
|
|
|
data_type只有detail
|
|
|
```
|
|
|
|
|
|
## 爬虫结果的超级数据
|
|
|
|
|
|
<!--包含所有字段的json数据,每个value都要有样例值-->
|
|
|
### 数据来源很多,首次统一解析后结果如下
|
|
|
|
|
|
```json
|
|
|
数据来源很多,首次统一解析后结果如下
|
|
|
{
|
|
|
"data": {
|
|
|
"court_items": [{"source": "http://hdzy.hbsfgk.org", "data": "<div class=\"ywzw_con_inner\">\r\n\t\t\t\t\t<p class=\"p_source \">2021-08-06 来源: 魏县人民法院</p>\r\n\t\t\t\t\t<h3 class=\"h3_title\">魏县人民法院</h3>\r\n\t\t\t\t\t<p class=\"p_notice\">公 告</p>\r\n\t\t\t\t\t<p class=\"p_text\">我院定于2021年10月28日 09时30分在本院第二审判庭依法公开审理姜国平诉任乃平民间借贷纠纷一案。</p>\r\n\t\t\t\t\t<p class=\"p_tcgg\">特此公告</p>\r\n\t\t\t\t\t<p class=\"p_date\">二〇二一年八月六日</p>\r\n\t\t\t\t</div>\r\n\t\t\t", "refer": "http://hdzy.hbsfgk.org/ktggInfo.jspx?fyid=153&bh=1D53B4BA51209B45055BB3DA0BE58188&isapp=null", "task_type": "开庭公告"}]}, "http_code": 200, "error_msg": "", "task_result": 1000, "data_type": "detail", "spider_start_time": "2021-08-17 17:23:45.238", "spider_end_time": "2021-08-17 17:23:56", "task_params": {}, "metadata": {}, "spider_name": "risk_court_notice", "spider_ip": "192.168.56.1"
|
|
|
"data": {
|
|
|
"court_items": [{
|
|
|
"source": "http://hdzy.hbsfgk.org",
|
|
|
"data": "<div class=\"ywzw_con_inner\">\r\n\t\t\t\t\t<p class=\"p_source \">2021-08-06 来源: 魏县人民法院</p>\r\n\t\t\t\t\t<h3 class=\"h3_title\">魏县人民法院</h3>\r\n\t\t\t\t\t<p class=\"p_notice\">公 告</p>\r\n\t\t\t\t\t<p class=\"p_text\">我院定于2021年10月28日 09时30分在本院第二审判庭依法公开审理姜国平诉任乃平民间借贷纠纷一案。</p>\r\n\t\t\t\t\t<p class=\"p_tcgg\">特此公告</p>\r\n\t\t\t\t\t<p class=\"p_date\">二〇二一年八月六日</p>\r\n\t\t\t\t</div>\r\n\t\t\t",
|
|
|
"refer": "http://hdzy.hbsfgk.org/ktggInfo.jspx?fyid=153&bh=1D53B4BA51209B45055BB3DA0BE58188&isapp=null",
|
|
|
"task_type": "开庭公告"
|
|
|
}]
|
|
|
},
|
|
|
"http_code": 200,
|
|
|
"error_msg": "",
|
|
|
"task_result": 1000,
|
|
|
"data_type": "detail",
|
|
|
"spider_start_time": "2021-08-17 17:23:45.238",
|
|
|
"spider_end_time": "2021-08-17 17:23:56",
|
|
|
"task_params": {},
|
|
|
"metadata": {},
|
|
|
"spider_name": "risk_court_notice",
|
|
|
"spider_ip": "192.168.56.1"
|
|
|
}
|
|
|
|
|
|
```
|
... | ... | @@ -193,7 +226,7 @@ data_type只有detail |
|
|
|
|
|
<!--udm模块?scrapy?或其他-->
|
|
|
|
|
|
```buildoutcfg
|
|
|
```
|
|
|
scrapy,udm初步解析
|
|
|
```
|
|
|
|
... | ... | @@ -201,19 +234,19 @@ scrapy,udm初步解析 |
|
|
|
|
|
<!--部署在哪些机器?每个机器多少进程?项目名称是什么?-->
|
|
|
|
|
|
```buildoutcfg
|
|
|
```
|
|
|
机器:node_51
|
|
|
项目名称:judicature_spider
|
|
|
```
|
|
|
|
|
|
## Taskhub地址
|
|
|
|
|
|
```buildoutcfg
|
|
|
```
|
|
|
```
|
|
|
|
|
|
## Taskhub调度规则说明
|
|
|
|
|
|
```buildoutcfg
|
|
|
```
|
|
|
task_result=1000 # 正常获取到详情任务
|
|
|
task_result=1101 # 无结果信息,表示官网查不到这个数据,同样需要解析入库
|
|
|
|
... | ... | @@ -226,7 +259,7 @@ task_result=8000 # 参数错误 |
|
|
|
|
|
<!--监控爬虫正常运行的指标是什么?报警规则是什么?-->
|
|
|
|
|
|
```buildoutcfg
|
|
|
```
|
|
|
(先观察,待补充)
|
|
|
索引:
|
|
|
监控频率:
|
... | ... | @@ -255,7 +288,7 @@ task_result=8000 # 参数错误 |
|
|
|
|
|
## 爬虫结果目录
|
|
|
|
|
|
```html
|
|
|
```
|
|
|
/data/judicature_spiders/risk_court_notice/BDP-C-1001_10.8.6.51
|
|
|
```
|
|
|
|
... | ... | @@ -266,12 +299,12 @@ task_result=8000 # 参数错误 |
|
|
|
|
|
## logstash配置文件名称
|
|
|
|
|
|
```html
|
|
|
```
|
|
|
```
|
|
|
|
|
|
## logstash文件采集type
|
|
|
|
|
|
```html
|
|
|
```
|
|
|
```
|
|
|
|
|
|
## 数据归集的topic
|
... | ... | @@ -282,7 +315,7 @@ topic_id => "general-taxpayer" |
|
|
|
|
|
## ES日志索引及筛选条件
|
|
|
|
|
|
```html
|
|
|
```
|
|
|
|
|
|
```
|
|
|
|
... | ... | |