
基本信息:
- 专利标题: 一种爬虫系统IO优化方法及装置
- 专利标题(英):Crawler system IO optimization method and device
- 申请号:CN201711088268.5 申请日:2017-11-07
- 公开(公告)号:CN107943858A 公开(公告)日:2018-04-20
- 发明人: 陈开冉 , 邓楚健
- 申请人: 广州探迹科技有限公司
- 申请人地址: 广东省广州市番禺区小谷围街青蓝街26号503
- 专利权人: 广州探迹科技有限公司
- 当前专利权人: 广州探迹科技有限公司
- 当前专利权人地址: 广东省广州市番禺区小谷围街青蓝街26号503
- 代理机构: 广州三环专利商标代理有限公司
- 代理人: 宋静娜; 郝传鑫
- 主分类号: G06F17/30
- IPC分类号: G06F17/30
The invention discloses a crawler system IO optimization method and device, relates to the field of software engineering, and aims at solving the problem that existing result storage work carried outby taking crawler tasks as units is low in IO efficiency and influences the retrieval efficiency. The method comprises the following steps of: caching received first crawlers by a first result processor, and when the fact that the quantity of cached crawling results exceeds an aggregation threshold value is determined, writing the plurality of crawling results into an aggregation file according toan end-to-end splicing method and recording a position offset of each crawling result; generating an aggregation path stored in a big file object storage system according to a content of the aggregation file and sending the aggregation file to the aggregation path; and generating an aggregation log which comprises each crawling result, the position offset of each crawling result, the aggregationpath and a number of each crawler according to the aggregation file and sending the aggregation log to a log processor.
IPC结构图谱:
G | 物理 |
--G06 | 计算;推算;计数 |
----G06F | 电数字数据处理 |
------G06F17/00 | 特别适用于特定功能的数字计算设备或数据处理设备或数据处理方法 |
--------G06F17/30 | .信息检索;及其数据库结构 |