当前位置: 首页 > news >正文

Hive高手之路:二、hive数据分析实验

数据定义、数据插入、数据查询实验跳过

目录

一.ELT原始数据处理

1.新建项目

2.添加项目依赖

3.文件内容 

1)数据抽取、拆分工具类函数 

2)继承Mapper

3)运行Runner 

4.打包

​编辑

 5.上传至hdfs

1)启动环境

 2)上传数据文件与jar包

3)数据上传至集群

4)运行jar包

 二.准备工作

1.准备hive环境

2.创建表

1)gulivideo_ori

2)gulivideo_user_ori 

3) gulivideo_orc 

 4)gulivideo_user_orc

3.导入ELT后的数据

4.向ORC表插入数据 

 5.查看导入的数据结果

 三、业务分析

1.统计观看视频数Top10

​编辑 2.统计视频类热度

 3.统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

 4.统计视频观看数Top50所关联视频的所属类别Rank

 5.统计每个类别中的视频热度Top10,以Music为例

1)创建类别表

 2)向类别表中插入数据

 3)统计Music类别的Top10(也可以统计其他)

 6.统计每个类别中视频流量Top10,以Music例

7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

 8.统计每个类别视频观看数Top10


一.ELT原始数据处理

1.新建项目

2.添加项目依赖

3.文件内容 

1)数据抽取、拆分工具类函数 

2)继承Mapper

3)运行Runner 

4.打包

 5.上传至hdfs

1)启动环境

 2)上传数据文件与jar包

3)数据上传至集群

[root@hadoop001 hive]#  hdfs dfs -put guiliVideo /

4)运行jar包

[root@hadoop001 hive]# yarn jar

 /root/hive/guli.video-1.0-SNAPSHOT.jar

com.guli.mapper.VideoETLRunner /guiliVideo/video/2008/0222 /guliout

 二.准备工作

1.准备hive环境

hive> SHOW DATABASES;

hive> SET hive.cli.print.current.db=true;

hive (default)> USE text;

hive (text)> set hive.cli.print.header=true;

2.创建表

1)gulivideo_ori

hive (text)> create table gulivideo_ori(

           >     videoId string,

           >     uploader string,

           >     age int,

           >     category array<string>,

           >     length int,

           >     views int,

           >     rate float,

           >     ratings int,

           >     comments int,

           >     relatedId array<string>)

           > row format delimited fields terminated by "\t"

           > collection items terminated by "&"

           > stored as textfile;

2)gulivideo_user_ori 

hive (text)> create table gulivideo_user_ori(

           >     uploader string,

           >     videos int,

           >     friends int)

           > row format delimited fields terminated by "\t"

           > stored as textfile;

3) gulivideo_orc 

hive (text)> create table gulivideo_orc(

           >     videoId string,

           >     uploader string,

           >     age int,

           >     category array<string>,

           >     length int,

           >     views int,

           >     rate float,

           >     ratings int,

           >     comments int,

           >     relatedId array<string>)

           > clustered by (uploader) into 8 buckets

           > row format delimited fields terminated by "\t"

           > collection items terminated by "&"

           > stored as orc;

 4)gulivideo_user_orc

hive (text)> create table gulivideo_user_orc(

           >     uploader string,

           >     videos int,

           >     friends int)

           > row format delimited fields terminated by "\t"

           > stored as orc;

3.导入ELT后的数据

hive (text)> load data inpath "/guliout" into table gulivideo_ori;

hive (text)> load data inpath "/guiliVideo/user/2008/0903" into table gulivideo_user_ori;

4.向ORC表插入数据 

hive (text)> insert into table gulivideo_orc select * from gulivideo_ori;

hive (text)> insert into table gulivideo_user_orc select * from gulivideo_user_ori;

 5.查看导入的数据结果

hive (text)> select * from gulivideo_orc limit 2;

hive (text)> select * from gulivideo_user_orc limit 2;

 三、业务分析

1.统计观看视频数Top10

hive (text)> select videoId,uploader,age,category,length,views,rate,ratings,comments from gulivideo_orc order by views desc limit 10;

 2.统计视频类热度

hive (text)> select

           >     category_name as category,

           >     count(t1.videoId) as hot

           > from (

           >     select

           >         videoId,

           >         category_name

           >     from

           >         gulivideo_orc lateral view explode(category) t_catetory as

           > category_name) t1

           > group by

           >     t1.category_name

           > order by

           >     hot

           > desc limit

           >     10;

 3.统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

hive (text)> select

           >     category_name as category,

           >     count(t2.videoId) as hot_with_views

           > from (

           >     select

           >         videoId,

           >         category_name

           >     from (

           >         select

           >             *

           >         from

           >             gulivideo_orc

           >         order by

           >             views

           >             desc limit

           >                 20) t1 lateral view explode(category) t_catetory as

           > category_name) t2

           > group by

           >     category_name

           > order by

           >     hot_with_views

           > desc;

 4.统计视频观看数Top50所关联视频的所属类别Rank

hive (text)> select

           >     category_name as category,

           >     count(t5.videoId) as hot

           > from (

           >     select

           >         videoId,

           >         category_name

           >     from (

           >         select

           >         distinct(t2.videoId),

           >         t3.category

           >         from (

           >             select

           >                 explode(relatedId) as videoId

           >             from (

           >                 select

           >                     *

           >                 from

           >                     gulivideo_orc

           >                 order by

           >                     views

           >                 desc limit

           >                     50) t1) t2

           >         inner join

           >             gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view

           > explode(category) t_catetory as category_name) t5

           > group by

           >     category_name

           > order by

           >     hot

           > desc;

 5.统计每个类别中的视频热度Top10,以Music为例

1)创建类别表

hive (text)> create table gulivideo_category(

           >     videoId string,

           >     uploader string,

           >     age int,

           >     categoryId string,

           >     length int,

           >     views int,

           >     rate float,

           >     ratings int,

           >     comments int,

           >     relatedId array<string>)

           > row format delimited

           > fields terminated by "\t"

           > collection items terminated by "&"

           > stored as orc;

 2)向类别表中插入数据

hive (text)> insert into table gulivideo_category

           >     select

           >         videoId,

           >         uploader,

           >         age,

           >         categoryId,

           >         length,

           >         views,

           >         rate,

           >         ratings,

           >         comments,

           >         relatedId

           >     from

           >         gulivideo_orc lateral view explode(category) catetory as

           > categoryId;

 3)统计Music类别的Top10(也可以统计其他)

hive (text)> select

           >     videoId,

           >     views

           > from

           >     gulivideo_category

           > where

           >     categoryId = "Music"

           > order by

           >     views

           > desc limit

           >     10;

 6.统计每个类别中视频流量Top10,以Music例

hive (text)> select

           >     videoId,

           >     views,

           >     ratings

           > from

           >     gulivideo_category

           > where

           >     categoryId = "Music"

           > order by

           >     ratings

           > desc limit

           >     10;

7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

hive (text)> select

           >     t2.videoId,

           >     t2.views,

           >     t2.ratings,

           >     t1.videos,

           >     t1.friends

           > from (

           >     select

           >         *

           >     from

           >         gulivideo_user_orc

           >     order by

           >         videos desc

           >     limit

           >         10) t1

           >     join

           >         gulivideo_orc t2

           >     on

           >         t1.uploader = t2.uploader

           >     order by

           >         views desc

           >     limit

           >         20;

 8.统计每个类别视频观看数Top10

hive (text)> select

           >     t1.*

           > from (

           >     select

           >         videoId,

           >         categoryId,

           >         views,

           > row_number() over(partition by categoryId order by views desc)

           > rank from gulivideo_category) t1

           > where

           >     rank <= 10;

相关文章:

  • 百度网站名称和网址/兰州网络推广关键词优化
  • 网站建设联系我们/my63777免费域名查询2023年
  • 医疗企业网站模板免费下载/服装品牌策划方案
  • 惠山区住房和建设厅网站/山东服务好的seo公司
  • 山东德州网站建设哪家最专业/市场调研方案范文
  • 古典网站案例/进行网络推广
  • 猿创征文|破世界纪录的国产数据库 - OceanBase
  • 【Day28】力扣算法(超详细思路+注释) [1790. 仅执行一次字符串交换能否使两个字符串相等 ] [328. 奇偶链表 ][148. 排序链表]
  • new 和 delete 为什么要匹配使用
  • M的编程备忘录之C++——C++11
  • Shell 脚本
  • 业务数据采集
  • 基于springboot求职招聘网站
  • 【java_wxid项目】【第十五章】【Spring Cloud Alibaba Sentinel集成】
  • Sharding-Jdbc实战之一:概述
  • C# 连接 MySQL 数据库
  • c语言分层理解(c语言文件操作)
  • python 矩阵运算