Hive高手之路:二、hive数据分析实验
数据定义、数据插入、数据查询实验跳过
目录
一.ELT原始数据处理
1.新建项目
2.添加项目依赖
3.文件内容
1)数据抽取、拆分工具类函数
2)继承Mapper
3)运行Runner
4.打包
编辑
5.上传至hdfs
1)启动环境
2)上传数据文件与jar包
3)数据上传至集群
4)运行jar包
二.准备工作
1.准备hive环境
2.创建表
1)gulivideo_ori
2)gulivideo_user_ori
3) gulivideo_orc
4)gulivideo_user_orc
3.导入ELT后的数据
4.向ORC表插入数据
5.查看导入的数据结果
三、业务分析
1.统计观看视频数Top10
编辑 2.统计视频类热度
3.统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
4.统计视频观看数Top50所关联视频的所属类别Rank
5.统计每个类别中的视频热度Top10,以Music为例
1)创建类别表
2)向类别表中插入数据
3)统计Music类别的Top10(也可以统计其他)
6.统计每个类别中视频流量Top10,以Music例
7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频
8.统计每个类别视频观看数Top10
一.ELT原始数据处理
1.新建项目
2.添加项目依赖
3.文件内容
1)数据抽取、拆分工具类函数
2)继承Mapper
3)运行Runner
4.打包
5.上传至hdfs
1)启动环境
2)上传数据文件与jar包
3)数据上传至集群
[root@hadoop001 hive]# hdfs dfs -put guiliVideo /
4)运行jar包
[root@hadoop001 hive]# yarn jar
/root/hive/guli.video-1.0-SNAPSHOT.jar
com.guli.mapper.VideoETLRunner /guiliVideo/video/2008/0222 /guliout
二.准备工作
1.准备hive环境
hive> SHOW DATABASES;
hive> SET hive.cli.print.current.db=true;
hive (default)> USE text;
hive (text)> set hive.cli.print.header=true;
2.创建表
1)gulivideo_ori
hive (text)> create table gulivideo_ori(
> videoId string,
> uploader string,
> age int,
> category array<string>,
> length int,
> views int,
> rate float,
> ratings int,
> comments int,
> relatedId array<string>)
> row format delimited fields terminated by "\t"
> collection items terminated by "&"
> stored as textfile;
2)gulivideo_user_ori
hive (text)> create table gulivideo_user_ori(
> uploader string,
> videos int,
> friends int)
> row format delimited fields terminated by "\t"
> stored as textfile;
3) gulivideo_orc
hive (text)> create table gulivideo_orc(
> videoId string,
> uploader string,
> age int,
> category array<string>,
> length int,
> views int,
> rate float,
> ratings int,
> comments int,
> relatedId array<string>)
> clustered by (uploader) into 8 buckets
> row format delimited fields terminated by "\t"
> collection items terminated by "&"
> stored as orc;
4)gulivideo_user_orc
hive (text)> create table gulivideo_user_orc(
> uploader string,
> videos int,
> friends int)
> row format delimited fields terminated by "\t"
> stored as orc;
3.导入ELT后的数据
hive (text)> load data inpath "/guliout" into table gulivideo_ori;
hive (text)> load data inpath "/guiliVideo/user/2008/0903" into table gulivideo_user_ori;
4.向ORC表插入数据
hive (text)> insert into table gulivideo_orc select * from gulivideo_ori;
hive (text)> insert into table gulivideo_user_orc select * from gulivideo_user_ori;
5.查看导入的数据结果
hive (text)> select * from gulivideo_orc limit 2;
hive (text)> select * from gulivideo_user_orc limit 2;
三、业务分析
1.统计观看视频数Top10
hive (text)> select videoId,uploader,age,category,length,views,rate,ratings,comments from gulivideo_orc order by views desc limit 10;
2.统计视频类热度
hive (text)> select
> category_name as category,
> count(t1.videoId) as hot
> from (
> select
> videoId,
> category_name
> from
> gulivideo_orc lateral view explode(category) t_catetory as
> category_name) t1
> group by
> t1.category_name
> order by
> hot
> desc limit
> 10;
3.统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
hive (text)> select
> category_name as category,
> count(t2.videoId) as hot_with_views
> from (
> select
> videoId,
> category_name
> from (
> select
> *
> from
> gulivideo_orc
> order by
> views
> desc limit
> 20) t1 lateral view explode(category) t_catetory as
> category_name) t2
> group by
> category_name
> order by
> hot_with_views
> desc;
4.统计视频观看数Top50所关联视频的所属类别Rank
hive (text)> select
> category_name as category,
> count(t5.videoId) as hot
> from (
> select
> videoId,
> category_name
> from (
> select
> distinct(t2.videoId),
> t3.category
> from (
> select
> explode(relatedId) as videoId
> from (
> select
> *
> from
> gulivideo_orc
> order by
> views
> desc limit
> 50) t1) t2
> inner join
> gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view
> explode(category) t_catetory as category_name) t5
> group by
> category_name
> order by
> hot
> desc;
5.统计每个类别中的视频热度Top10,以Music为例
1)创建类别表
hive (text)> create table gulivideo_category(
> videoId string,
> uploader string,
> age int,
> categoryId string,
> length int,
> views int,
> rate float,
> ratings int,
> comments int,
> relatedId array<string>)
> row format delimited
> fields terminated by "\t"
> collection items terminated by "&"
> stored as orc;
2)向类别表中插入数据
hive (text)> insert into table gulivideo_category
> select
> videoId,
> uploader,
> age,
> categoryId,
> length,
> views,
> rate,
> ratings,
> comments,
> relatedId
> from
> gulivideo_orc lateral view explode(category) catetory as
> categoryId;
3)统计Music类别的Top10(也可以统计其他)
hive (text)> select
> videoId,
> views
> from
> gulivideo_category
> where
> categoryId = "Music"
> order by
> views
> desc limit
> 10;
6.统计每个类别中视频流量Top10,以Music例
hive (text)> select
> videoId,
> views,
> ratings
> from
> gulivideo_category
> where
> categoryId = "Music"
> order by
> ratings
> desc limit
> 10;
7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频
hive (text)> select
> t2.videoId,
> t2.views,
> t2.ratings,
> t1.videos,
> t1.friends
> from (
> select
> *
> from
> gulivideo_user_orc
> order by
> videos desc
> limit
> 10) t1
> join
> gulivideo_orc t2
> on
> t1.uploader = t2.uploader
> order by
> views desc
> limit
> 20;
8.统计每个类别视频观看数Top10
hive (text)> select
> t1.*
> from (
> select
> videoId,
> categoryId,
> views,
> row_number() over(partition by categoryId order by views desc)
> rank from gulivideo_category) t1
> where
> rank <= 10;