Code Monkey home page Code Monkey logo

bb_everyday's Introduction

bb_everyday's People

Contributors

aikuyun avatar

bb_everyday's Issues

Hive 优化实践

1. Group by 数据倾斜

问题一: hive 默认开启在 map 端预聚合的(参数:hive.map.aggr 默认是 ture),使用 group by 的时候,由于某个维度数据倾斜,导致 map 端内存溢出。

现象:map 任务大量失败,Gc limit。

改进:

方案1:set hive.map.aggr = false;

方案2:hive.groupby.skewindata=ture; 开启负载均衡,如果发生存在数据倾斜的 key,则会分成两个 mr 任务,一个用来进行打散预聚合,另一个是最终聚合。

编写 udf 函数解析 ip,踩坑之路

ip 离线库的选择

  • 埃文科技: 离线库文件有点大,不选择。
  • maxmind:自己封装的格式更快,二分查找,选择这个。

Java 代码片段

加入 pom 依赖:

<dependency>
     <groupId>com.maxmind.db</groupId>
      <artifactId>maxmind-db</artifactId>
     <version>1.2.2</version>
</dependency>

 <dependency>
      <groupId>com.maxmind.geoip2</groupId>
      <artifactId>geoip2</artifactId>
       <version>2.12.0</version>
  </dependency>

打包插件

<build>
     <plugins>
         <plugin>
              <artifactId>maven-assembly-plugin</artifactId>
              <configuration>
                   <!--这部分可有可无,加上的话则直接生成可运行jar包-->
                    <!--<archive>-->
                    <!--<manifest>-->
                    <!--<mainClass>${exec.mainClass}</mainClass>-->
                    <!--</manifest>-->
                    <!--</archive>-->
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
            </plugin>
   </plugins>
 </build>

Java 代码

package common.udf;

import java.io.File;
import java.io.IOException;

import com.maxmind.geoip2.DatabaseReader;
import com.maxmind.geoip2.record.Country;
import com.maxmind.geoip2.record.City;
import org.apache.hadoop.hive.ql.exec.UDF;
import com.maxmind.geoip2.exception.GeoIp2Exception;
import com.maxmind.geoip2.model.CityResponse;

import java.net.InetAddress;
import java.net.UnknownHostException;

/**
 * 使用前提:
 * create temporary function ip_analyse as 'common.udf.IP2Location' using 'hdfs:///jars/hive-custom-udf-2.1-jar-with-dependencies.jar'
 * set mapred.cache.files=/data/ip/GeoLite2-City.mmdb#GeoLite2-City.mmdb;
 *
 * 开始使用:
 * select ip_analyse('ip','0'); --> 国家
 * select ip_analyse('ip','1'); --> 城市
 * select ip_analyse('ip','2'); --> 国家:城市
 */

/**
 * @program: hive_custom_udf
 * @description: 解析 IP 到城市信息
 * @author: TSL
 * @create: 2019-07-08 10:25
 **/
public class IP2Location extends UDF{

    /**
     * @param ip ip地址
     * @param type 解析类型:0 国家, 1 城市, 3 全部
     * @return
     * @throws IOException
     * @throws GeoIp2Exception
     */
    public static String evaluate(String ip,String type) {
        return IPAnalyse(ip,type);
    }


    /**
     * @param ip  ip地址
     * @return 返回字符串
     * @throws IOException
     * @throws GeoIp2Exception
     */
    public  static String IPAnalyse(String ip,String type){

        String res = "";

        // geoip2或geolite2数据库的文件

        File database  = new File("GeoLite2-City.mmdb");

        //这将创建databasereader对象。为了提高性能,重用跨查找的对象。对象是线程安全的
        DatabaseReader reader = null;
        try {
            reader = new DatabaseReader.Builder(database).build();
        } catch (IOException e) {
            e.printStackTrace();
        }

        // 构造 ip 对象
        InetAddress ipAddress = null;
        try {
            ipAddress = InetAddress.getByName(ip);
        } catch (UnknownHostException e) {
            e.printStackTrace();
        }

        // 构造响应对象
        CityResponse response = null;
        try {
            response = reader.city(ipAddress);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (GeoIp2Exception e) {
            e.printStackTrace();
        }

        // 返回不同类型的数据,具体到城市
        if ("0".equals(type)){
            Country country = response.getCountry();
            res = country.getNames().get("zh-CN");
        }else if("1".equals(type)){
            City city = response.getCity();
            res = city.getNames().get("zh-CN");
        }else if ("2".equals(type)){
            Country country = response.getCountry();
            City city = response.getCity();
            res = country.getNames().get("zh-CN")+":"+city.getNames().get("zh-CN");
        }else {
            res = "error";

        }
        return res;

    }

}

如何使用

  1. 打包之后,将 jar 包上传至 HDFS 的一个目录,我这里是 /jars`

执行命令:

hdfs dfs -put xxxx.jar /jars/
  1. 将离线包传到 HDFS 的一个目录:/data

执行命令:

hdfs dfs -put GeoLite2-City.mmdb /data
  1. 在 sql 中使用:
create temporary function ip_analyse as 'common.udf.IP2Location' using 'hdfs:///jars/xxx.jar';
set mapred.cache.files=/data/GeoLite2-City.mmdb#GeoLite2-City.mmdb;

select analyse('12.234.67.172','0') -- 国家
select analyse('12.234.67.172','1') -- 城市
select analyse('12.234.67.172','2') -- 国家:城市

DistributedCache是Hadoop提供的文件缓存工具。它能够自己主动将指定的文件分发到各个节点上,缓存到本地,供用户程序读取使用

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.