Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
K
kb
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • granite
  • kb
  • Wiki
    • Data_stream
    • Ec
  • taobao_find_new_shop

Last edited by 章一锋 Aug 26, 2021
Page history

taobao_find_new_shop

基本信息

数据名称(中文)

店铺找新

数据英文名称

taobao_find_new_shop

采集网站(采集入口)

网页url:  https://ai.m.taobao.com/search.html?spm=a311n.14543358.8001.4&q=%E5%A4%96%E5%A5%97&keyword=%E5%A4%96%E5%A5%97&pid=mm_33231688_7050284_23466709&union_lens=recoveryid%3Aa311n.9159044_1628474552809_2395600509837521_gkKSGSYFaT%3Bprepvid%3A201_11.29.177.232_5909580_1628480622548
采集接口: https://h5api.m.taobao.com/h5/mtop.alimama.union.xt.biz.quan.api.entry/1.0/

采集频率及采集策略

存量更新策略

每天运行,一轮时间尚未测出

增量采集策略

暂无

爬虫

责任人

章一锋

爬虫名称

taobao_remai_search_goods

代码地址

http://tech.pingansec.com/granite/project-ec/-/blob/develop_udms_20210113/app_taobao/udms/taobao_find_new_shop/taobao_remai_search_goods.py

队列名称及队列地址

* redis host: redis://:utn@0818@bdp-mq-002.redis.rds.aliyuncs.com:6379/0
* redis port: 6379
* redis db:   0
* redis key:  taobao_find_new_shop:10

优先级队列说明

任务来源

25万商品搜索关键词

任务输入参数(样例)

任务样例

{
    "search_key": "匡威粉色",   
    "limit_page_count": 30        指定爬虫最大页数
}

任务参数说明

data_type说明

detail: 详情
log:    日志

爬虫结果的超级数据

{
	"metadata": {
		"seller_id": "2208075902421",
		"coupon_share_url": null,
		"goods_id": "650692637050",
		"platform_name": "淘宝",
		"new_shop": true
	},
	"data": {
		"floorId": 31512,
		"itemId": "650692637050",
		"itemName": "正品南怀瑾肚脐贴艾灸贴祛湿祛宫寒调理暖宫贴艾草贴瘦身健身器材",
		"subTitle": "",
		"userType": 0,
		"isTmall": false,
		"sellerId": 2208075902421,
		"reservePrice": "34.98",
		"price": "34.98",
		"promotionPrice": "20.99",
		"priceAfterCoupon": "20.99",
		"originalTkSpRates": "",
		"zkSalesPrice": "0.00",
		"pic": "//img.alicdn.com/bao/uploaded/i3/2208075902421/O1CN01oSKtU51TkrZJp7Cjl_!!2208075902421.jpg",
		"smallImage": ["//img.alicdn.com/i1/2208075902421/O1CN01CmrNul1TkrZIeak3X_!!2208075902421.jpg", "//img.alicdn.com/i4/2208075902421/O1CN01xyL14q1TkrZENz1W2_!!2208075902421.jpg", "//img.alicdn.com/i3/2208075902421/O1CN01RnUyoF1TkrZENygj4_!!2208075902421.jpg", "//img.alicdn.com/i1/2208075902421/O1CN01gZxAYw1TkrZMydwjQ_!!2208075902421.jpg"],
		"couponAmount": null,
		"couponSendCount": 0,
		"couponTotalCount": 0,
		"couponTag": "",
		"couponEffectiveStartTime": "",
		"couponEffectiveEndTime": "",
		"couponStartFee": null,
		"monthSellCount": "0",
		"provcity": "浙江 杭州",
		"nick": "梦琪家",
		"realPostFee": "0.00",
		"auctionTags": "385 907 1035 1163 1483 1995 2059 2123 4491 4939 5323 6603 10571 11083 11339 11467 17739 21442 22155 25282 27137 52290 67521 85249 104514 143746 235713 241985 249858 362178 368066 1362178 1797506",
		"ostime": null,
		"oetime": null,
		"jddPrice": null,
		"jddNum": null,
		"uvSumPreSale": "",
		"couponShareUrl": null,
		"url": "//s.click.taobao.com/t?e=m%3D2%26s%3Da0%2F%2BsK6S%2FPhw4vFB6t2Z2ueEDrYVVa64qYbrUZilZ4UKwPl3T8wu7Plz6J5XeLYAg2PPeAJmYeE%2FmLO%2F5foB9eoryUtqIh4%2B4jMnl1H7sduZ4Y8JljmSntgXBYXsIl%2F%2Fl8GOlPX6XToTmB4bcjQoTQCA9QH%2B3%2BdLcSpj5qSCmbA%3D&scm=null&pvid=100_11.250.13.76_95758_9741629946387638189&app_pvid=59590_33.5.221.211_679_1629946387632&ptl=floorId:31512;originalFloorId:31512;pvid:100_11.250.13.76_95758_9741629946387638189;app_pvid:59590_33.5.221.211_679_1629946387632&union_lens=lensId%3AOPT%401629946387%402105ddd3_5e0b_17b80601cec_9eca%4001",
		"udf_temp_store": {},
		"lensId": "OPT@1629946387@2105ddd3_5e0b_17b80601cec_9eca@01"
	},
	"data_type": "detail",
	"spider_name": "taobao_remai_search_goods",
	"task_params": {
		"search_key": "肚脐贴 南怀瑾 宫寒",
		"limit_page_count": 30
	},
	"http_code": 200,
	"task_result": 1001,
	"error_msg": "",
	"platform_name": "淘宝热卖",
	"spider_start_time": "2021-08-26 10:52:08.626",
	"spider_end_time": "2021-08-26 10:53:08.092",
	"spider_used_time_ms": 59466,
	"spider_ip": "10.8.6.63"
}

实际爬虫结果的数据结构

爬虫运行环境

udm

爬虫部署信息

爬虫运行机器:10.8.6.63
进程数:50
项目名称:ec

Taskhub地址

http://tech.pingansec.com/granite/project-taskhub/-/blob/master/taskhub/config/ec/config.d/taobao.yaml

Taskhub调度规则说明

task_result为以下值时被过滤:
    - 1000
    - 1101
    - 1102
    - 2001
    - 7000
    - 9300
其他值的任务都会被放入队列

爬虫监控指标设计

爬虫待采集结果目录


数据归集

责任人

范召贤

数据归集方式

  • 爬虫直接写kafka

  • [] 爬虫写文件logstash采集

归集后存放目录

/data2/ec_spider_result/taobao_remai_search_goods

logstash配置文件名称

logstash文件采集type

数据归集的topic

ec-spider-taobao-data

ES日志索引及筛选条件

ec-spider-data-*

监控指标看板

数据保留策略


数据清洗

责任人

代码地址

部署地址

部署方法及说明

  • crontab + data_pump
  • supervisor + data_pump
  • supervisor + consumer

数据接收来源

数据存储表地址

  • 数据库地址:
  • 表名:
Clone repository
  • README
  • basic_guidelines
  • basic_guidelines
    • basic_guidelines
    • dev_guide
    • project_build
    • 开发流程
  • best_practice
  • best_practice
    • AlterTable
    • RDS
    • azkaban
    • create_table
    • design
    • elasticsearch
    • elasticsearch
      • ES运维
    • logstash
View All Pages