- 💻 Currently working in Shanghai, China
- 📫 How to reach me: [email protected]
- 📡 You may like My Website
- 🏄♂️ 稳定、快速的梯子
aikuyun / bb_everyday Goto Github PK
View Code? Open in Web Editor NEW每天不定时更新自己的学习记录。
每天不定时更新自己的学习记录。
简单的说,需求是一个函数,而需求量是上面的一个点。
问题一: hive 默认开启在 map 端预聚合的(参数:hive.map.aggr 默认是 ture),使用 group by 的时候,由于某个维度数据倾斜,导致 map 端内存溢出。
现象:map 任务大量失败,Gc limit。
改进:
方案1:set hive.map.aggr = false;
方案2:hive.groupby.skewindata=ture; 开启负载均衡,如果发生存在数据倾斜的 key,则会分成两个 mr 任务,一个用来进行打散预聚合,另一个是最终聚合。
加入 pom 依赖:
<dependency>
<groupId>com.maxmind.db</groupId>
<artifactId>maxmind-db</artifactId>
<version>1.2.2</version>
</dependency>
<dependency>
<groupId>com.maxmind.geoip2</groupId>
<artifactId>geoip2</artifactId>
<version>2.12.0</version>
</dependency>
打包插件
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<!--这部分可有可无,加上的话则直接生成可运行jar包-->
<!--<archive>-->
<!--<manifest>-->
<!--<mainClass>${exec.mainClass}</mainClass>-->
<!--</manifest>-->
<!--</archive>-->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</build>
Java 代码
package common.udf;
import java.io.File;
import java.io.IOException;
import com.maxmind.geoip2.DatabaseReader;
import com.maxmind.geoip2.record.Country;
import com.maxmind.geoip2.record.City;
import org.apache.hadoop.hive.ql.exec.UDF;
import com.maxmind.geoip2.exception.GeoIp2Exception;
import com.maxmind.geoip2.model.CityResponse;
import java.net.InetAddress;
import java.net.UnknownHostException;
/**
* 使用前提:
* create temporary function ip_analyse as 'common.udf.IP2Location' using 'hdfs:///jars/hive-custom-udf-2.1-jar-with-dependencies.jar'
* set mapred.cache.files=/data/ip/GeoLite2-City.mmdb#GeoLite2-City.mmdb;
*
* 开始使用:
* select ip_analyse('ip','0'); --> 国家
* select ip_analyse('ip','1'); --> 城市
* select ip_analyse('ip','2'); --> 国家:城市
*/
/**
* @program: hive_custom_udf
* @description: 解析 IP 到城市信息
* @author: TSL
* @create: 2019-07-08 10:25
**/
public class IP2Location extends UDF{
/**
* @param ip ip地址
* @param type 解析类型:0 国家, 1 城市, 3 全部
* @return
* @throws IOException
* @throws GeoIp2Exception
*/
public static String evaluate(String ip,String type) {
return IPAnalyse(ip,type);
}
/**
* @param ip ip地址
* @return 返回字符串
* @throws IOException
* @throws GeoIp2Exception
*/
public static String IPAnalyse(String ip,String type){
String res = "";
// geoip2或geolite2数据库的文件
File database = new File("GeoLite2-City.mmdb");
//这将创建databasereader对象。为了提高性能,重用跨查找的对象。对象是线程安全的
DatabaseReader reader = null;
try {
reader = new DatabaseReader.Builder(database).build();
} catch (IOException e) {
e.printStackTrace();
}
// 构造 ip 对象
InetAddress ipAddress = null;
try {
ipAddress = InetAddress.getByName(ip);
} catch (UnknownHostException e) {
e.printStackTrace();
}
// 构造响应对象
CityResponse response = null;
try {
response = reader.city(ipAddress);
} catch (IOException e) {
e.printStackTrace();
} catch (GeoIp2Exception e) {
e.printStackTrace();
}
// 返回不同类型的数据,具体到城市
if ("0".equals(type)){
Country country = response.getCountry();
res = country.getNames().get("zh-CN");
}else if("1".equals(type)){
City city = response.getCity();
res = city.getNames().get("zh-CN");
}else if ("2".equals(type)){
Country country = response.getCountry();
City city = response.getCity();
res = country.getNames().get("zh-CN")+":"+city.getNames().get("zh-CN");
}else {
res = "error";
}
return res;
}
}
执行命令:
hdfs dfs -put xxxx.jar /jars/
执行命令:
hdfs dfs -put GeoLite2-City.mmdb /data
create temporary function ip_analyse as 'common.udf.IP2Location' using 'hdfs:///jars/xxx.jar';
set mapred.cache.files=/data/GeoLite2-City.mmdb#GeoLite2-City.mmdb;
select analyse('12.234.67.172','0') -- 国家
select analyse('12.234.67.172','1') -- 城市
select analyse('12.234.67.172','2') -- 国家:城市
DistributedCache是Hadoop提供的文件缓存工具。它能够自己主动将指定的文件分发到各个节点上,缓存到本地,供用户程序读取使用
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.