会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 1. 发明申请
    • METHOD AND APPARATUS FOR RETRIEVING AND INDEXING HIDDEN WEB PAGES
    • 用于检索和引导隐藏网页的方法和装置
    • WO2006007229A1
    • 2006-01-19
    • PCT/US2005/018849
    • 2005-05-27
    • THE REGENTS OF THE UNIVERSITY OF CALIFORNIANTOULAS, AlexandrosCHO, JunghooZERFOS, Petros
    • NTOULAS, AlexandrosCHO, JunghooZERFOS, Petros
    • G06F15/16
    • G06F17/30864
    • A method and system is provided for autonomously downloading and indexing Hidden Web pages from Websites having site-specific search interfaces. The method may be implemented using a crawler program or the like to autonomously cull Hidden Web content The method includes the steps of selecting a query term (110) and issuing a query to a site-specific (120) search interface containing Hidden Web pages. A results index is then acquired and the Hidden Web pages are downloaded from the results index. A plurality of potential query terms are then identified from the downloaded Hidden Web pages (130). The efficiency of each potential query term is then estimated and a next query term is selected from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency. The next selected query term is then issued to the site-specific search interface using the next query term. The process is repeated until all or most of the Hidden Web pages are discovered. In one aspect of the invention, the efficiency of each potential query term is expressed as a ratio of number of new documents returned for the potential query term to the cost associated with issuing the potential query.
    • 提供了一种方法和系统,用于从具有特定于站点的搜索界面的网站自动下载和索引隐藏的网页。 该方法可以使用爬虫程序等来实现以自主地隐藏隐藏的Web内容。该方法包括以下步骤:选择查询项(110)并向包含隐藏网页的站点特定(120)搜索界面发出查询。 然后获取结果索引,并从结果索引中下载隐藏的网页。 然后从下载的隐藏网页(130)中识别出多个潜在查询词。 然后估计每个潜在查询项的效率,并且从多个潜在查询项中选择下一个查询项,其中下一个所选择的查询项具有最大的效率。 然后使用下一个查询项将下一个选定的查询项发布到特定于站点的搜索界面。 重复该过程,直到发现所有或大部分隐藏的网页。 在本发明的一个方面,每个潜在查询项的效率被表示为为潜在查询项返回的新文档的数量与发布潜在查询相关联的成本的比率。