Code Monkey home page Code Monkey logo

bigdata's Introduction

BigData

问题描述:

100GB URL文件,1GB内存,求出TOP100的URL。

算法描述:

1.将100GB文件过于大,超出了1GB的内存,所以将大文件分成若干小文件。

2.每个小文件进行排序,求出前100的URL。(使用TreeMap)

3.最后合并求出TOP100的URL。

边界情况:

当存在多个hash相等的URL时,会存在多个URL都要存在同一个小文件,造成超过1GB内存的情况。

解决办法:

利用解决hash冲突方法,Hi=(H(key)+di)% m,用再散列的函数来求一个新的hash值。让所有url分布均匀。

测试:

输入集:

1.大部分URL都分布均匀。

2.大部分的URL都hash值一样。

bigdata's People

Contributors

fan1122 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.