博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
cascading-simhash a library to cluster by minhashes in Hadoop
阅读量:7193 次
发布时间:2019-06-29

本文共 698 字,大约阅读时间需要 2 分钟。

cascading-simhash a library to cluster by minhashes in Hadoop

simhashing

Say you have a large corpus of web documents and you want to group them together by some notion of “similarity”. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.

In this scenario, it’s impractical to do a pairwise comparison of all documents. Fortunately, we can use simhashing.

Broadly speaking, simhashing is a algorithm that calculates a “cluster id” (the minimum hash, or minhash) from the content. Because the minhash for an item is calculated independently of the other items in the set, minhashing is an ideal candidate for MapReduce.

转载地址:http://dhvkm.baihongyu.com/

你可能感兴趣的文章
Flex 布局教程:语法篇
查看>>
JVM内存模型和内存分配学习心得
查看>>
学术家族树典型用户的场景模拟
查看>>
CUDA-GPU编程
查看>>
JSP+Servlet实现验证码生成
查看>>
Winform下的Datagrid的列风格(1)—DataGridComboBoxColumn (ZHUAN)
查看>>
Java中instanceof的用法
查看>>
返回一个二维整数数组中最大联通子数组的和
查看>>
[学习笔记]阶和原根
查看>>
js事件委托
查看>>
计算机硬件
查看>>
gattAttribute_t 含义 中文解释
查看>>
jquery 选择器汇总
查看>>
Nodejs 学习资料
查看>>
设计模式(三) 抽象工厂模式
查看>>
置换群的快速幂运算
查看>>
post7
查看>>
Spring.net 学习IOC------准备
查看>>
zend studio xdebug配置详解
查看>>
pydoc用法
查看>>