假如有10亿QQ号如何去重?
前言最近在网上看到一个问题:10亿QQ号如何去重?
我觉得挺有意思的。
今天这篇文章跟大家一起分享一下常见的解决方案,希望对你会有所帮助。
更多项目实战在我的技术网站:http://www.susan.net.cn/project
一、技术难点
1.1 数据规模分析
<ul>原始数据:10亿×8字节 = 8GB
HashSet去重:至少16GB内存(Java对象开销)
理想方案:> 3) + 1]; // 每byte存储8个数字 } public void add(int num) { int arrayIndex = num >> 3;// num/8 int position = num & 0x07;// num%8 bits |= 1 > 3; int position = num & 0x07; return (bits & (1 > 3); int position = (int)(num & 7); bits |= 1{ try { r.close(); } catch (IOException e) {}}); }}class MergeEntry implements Comparable { String value; BufferedReader reader; public MergeEntry(String value, BufferedReader reader) { this.value = value; this.reader = reader; } @Override public int compareTo(MergeEntry o) { return this.value.compareTo(o.value); }}五、分布式解决方案
5.1 分片策略设计
5.2 Spark实现方案
public class BitMap {
private final byte[] bits;
public BitMap(int maxNum) {
this.bits = new byte[(maxNum >> 3) + 1]; // 每byte存储8个数字
}
public void add(int num) {
int arrayIndex = num >> 3;// num/8
int position = num & 0x07;// num%8
bits |= 1 << position;
}
public boolean contains(int num) {
int arrayIndex = num >> 3;
int position = num & 0x07;
return (bits & (1 << position)) != 0;
}
}5.3 内存优化:RoaringBitmap
存储优势对比:
(10^10 - 10^4) / 8 / 1024/1024 ≈ 1.16GB六、生产级架构:Lambda架构
6.1 系统架构图
6.2 各层技术选型
架构层技术栈处理目标批处理层Spark + HDFS全量数据去重速度层Flink + Redis实时增量去重服务层Spring Boot + HBase统一查询接口6.3 实时去重实现
// 偏移量优化:存储(qq - 10000)
public void add(long qq) {
long num = qq - 10000;
int arrayIndex = (int)(num >> 3);
int position = (int)(num & 7);
bits |= 1 << position;
}七、终极方案:分层位图索引
7.1 架构设计
7.2 存储计算
QQ号范围:10000 - 9999999999(约100亿)
分层设计:
[*]第一层分片:100个区间(每区间1亿)
[*]第二层分片:100个子区间(每区间100万)
[*]第三层存储:RoaringBitmap(每分区1.2MB)
总内存需求:
public class BloomFilter {
private final BitSet bitset;
private final int size;
private final int[] seeds;
public BloomFilter(int size, int hashCount) {
this.bitset = new BitSet(size);
this.size = size;
this.seeds = new int;
for (int i = 0; i < hashCount; i++) {
seeds = i * 31;
}
}
public void add(String qq) {
for (int seed : seeds) {
int hash = hash(qq, seed);
bitset.set(Math.abs(hash % size), true);
}
}
public boolean contains(String qq) {
for (int seed : seeds) {
int hash = hash(qq, seed);
if (!bitset.get(Math.abs(hash % size))) {
return false;
}
}
return true;
}
private int hash(String value, int seed) {
// MurmurHash 实现
int result = 0;
for (char c : value.toCharArray()) {
result = seed * result + c;
}
return result;
}
}7.3 Java实现
// 外部排序
public void externalSort(String input, String output) throws IOException {
List<File> chunks = splitAndSort(input, 100_000_000); // 每个文件1千万
mergeFiles(chunks, output);
}
// 多路归并
void mergeFiles(List<File> files, String output) {
PriorityQueue<MergeEntry> queue = new PriorityQueue<>();
List<BufferedReader> readers = new ArrayList<>();
// 初始化堆
for (File file : files) {
BufferedReader reader = new BufferedReader(new FileReader(file));
readers.add(reader);
String line = reader.readLine();
if (line != null) {
queue.add(new MergeEntry(line, reader));
}
}
try (BufferedWriter writer = new BufferedWriter(new FileWriter(output))) {
long last = -1;
while (!queue.isEmpty()) {
MergeEntry entry = queue.poll();
long qq = Long.parseLong(entry.value);
// 去重:只写入不重复的QQ号
if (qq != last) {
writer.write(entry.value);
writer.newLine();
last = qq;
}
// 读取下一行
String next = entry.reader.readLine();
if (next != null) {
queue.add(new MergeEntry(next, entry.reader));
}
}
} finally {
readers.forEach(r -> { try { r.close(); } catch (IOException e) {}});
}
}
class MergeEntry implements Comparable<MergeEntry> {
String value;
BufferedReader reader;
public MergeEntry(String value, BufferedReader reader) {
this.value = value;
this.reader = reader;
}
@Override
public int compareTo(MergeEntry o) {
return this.value.compareTo(o.value);
}
}八、方案对比与选型建议
方案适用场景内存/存储时间复杂度精度单机位图
页:
[1]