怎么上传网站,贵阳h5网站建设,wordpress 不同分类页面,福永公司网站建设drill apacheApache Drill是一个引擎#xff0c;可以连接到许多不同的数据源#xff0c;并为它们提供SQL接口。 它不仅是遍历任何复杂事物SQL界面#xff0c;而且是功能强大的界面#xff0c; 其中包括对许多内置函数和窗口函数的支持。 尽管它可以连接到可以使用SQL进行查… drill apache Apache Drill是一个引擎可以连接到许多不同的数据源并为它们提供SQL接口。 它不仅是遍历任何复杂事物SQL界面而且是功能强大的界面 其中包括对许多内置函数和窗口函数的支持。 尽管它可以连接到可以使用SQL进行查询的标准数据源例如Oracle或MySQL但它还可以处理诸如CSV或JSON等平面文件以及Avro和Parquet格式。 正是这种对文件运行SQL的能力首先激发了我对Apache Drill的兴趣。 我花了大量时间研究大数据架构和工具包括大数据发现 。 作为此过程的一部分尝试数据管道选项 我发现的差距之一是在将它们引入Hive之类的东西之前可以以原始状态挖掘文件的功能从而可以通过BDD和其他工具。 在本文中我将逐步介绍Apache Drill的入门知识并展示一些我认为是其有用性的很好例子的查询类型。 入门 开始使用Apache Drill非常简单–只需下载并解压缩并运行即可。 虽然它可以跨机器分布运行以提高性能但也可以在笔记本电脑上独立运行。 启动它 cd /opt/apache-drill-1.7.0/
bin/sqlline -u jdbc:drill:zklocal 如果您No current connection或com.fasterxml.jackson.databind.JavaType.isReferenceType()Z则您遇到冲突的JAR问题 例如我在Oracle BigDataLite VM上遇到此问题 应在干净的环境下启动它 env -i HOME$HOME LC_CTYPE${LC_ALL:-${LC_CTYPE:-$LANG}} PATH$PATH USER$USER /opt/apache-drill-1.7.0/bin/drill-embedded 有一个内置数据集可用于测试 USE cp;
SELECT employee_id, first_name FROM employee.json limit 5; 如果您习惯使用SQL * Plus和类似的工具那么在非常熟悉的环境中这应该返回五行 0: jdbc:drill:zklocal USE cp;
----------------------------------------
| ok | summary |
----------------------------------------
| true | Default schema changed to [cp] |
----------------------------------------
1 row selected (1.776 seconds)
0: jdbc:drill:zklocal SELECT employee_id, first_name FROM employee.json limit 5;
---------------------------
| employee_id | first_name |
---------------------------
| 1 | Sheri |
| 2 | Derrick |
| 4 | Michael |
| 5 | Maya |
| 6 | Roberta |
---------------------------
5 rows selected (3.624 seconds) 到目前为止是如此SQL如此的关系-如此熟悉真的。 Apache Drill开始偏离明显之处的是它对存储处理程序的使用。 在上面的查询中cp是我们要对其运行查询的“数据库”但实际上这是默认情况下定义的“类路径”因此称为“ cp”存储处理程序。 在“数据库”中存在“方案”它们是存储处理程序的子配置。 稍后我们将查看和定义这些内容。 现在知道您也可以列出可用的数据库很有用 0: jdbc:drill:zklocal show databases;
---------------------
| SCHEMA_NAME |
---------------------
| INFORMATION_SCHEMA |
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| sys |
--------------------- 注意databases命令是schemas的同义词; 两者都返回的是database.schema 。 在Apache Drill中反引号用于包围标识符例如模式名称列名称等并且它非常具体。 例如这是有效的 0: jdbc:drill:zklocal USE cp.default;
------------------------------------------------
| ok | summary |
------------------------------------------------
| true | Default schema changed to [cp.default] |
------------------------------------------------
1 row selected (0.171 seconds) 虽然不是 0: jdbc:drill:zklocal USE cp.default;
Error: PARSE ERROR: Encountered . default at line 1, column 7.
Was expecting one of:
EOF
. IDENTIFIER ...
. QUOTED_IDENTIFIER ...
. BACK_QUOTED_IDENTIFIER ...
. BRACKET_QUOTED_IDENTIFIER ...
. UNICODE_QUOTED_IDENTIFIER ...
. * ...SQL Query USE cp.default 这是因为default是保留字因此必须加引号。 因此您也可以使用 0: jdbc:drill:zklocal use cp.default; 但不是 0: jdbc:drill:zklocal use cp.default; 查询JSON数据 在Apache Drill网站上有一些有用的教程 其中包括一个使用Yelp提供的数据的教程 。 这是最初让我着眼于Drill的数据集因为我将其用作大数据发现 BDD的输入但在两个方面都遇到了困难。 首先是如何最好地在其上定义合适的Hive表以便将其提取到BDD中。 接下来我们试图了解数据中可能包含的价值这将花费多长时间来完善我在Hive中公开数据的方式。 以下示例显示了以表格形式查询时复杂JSON可能带来的复杂性。 首先查询JSON文件并自动推断出架构。 很酷 0: jdbc:drill:zklocal select * from /user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json limit 5;
-----------------------------------------------
| user_id | text | business_id | likes | date | type |
-----------------------------------------------
| -6rEfobYjMxpUWLNxszaxQ | Dont waste your time. | cE27W9VPgO88Qxe4ol6y_g | 0 | 2013-04-18 | tip |
| EZ0r9dKKtEGVx2CdnowPCw | Your GPS will not allow you to find this place. Put Rankin police department in instead. They are directly across the street. | mVHrayjG3uZ_RLHkLj-AMg | 1 | 2013-01-06 | tip |
| xb6zEQCw9I-Gl0g06e1KsQ | Great drink specials! | KayYbHCt-RkbGcPdGOThNg | 0 | 2013-12-03 | tip |
| QawZN4PSW7ng_9SP7pjsVQ | Friendly staff, good food, great beer selection, and relaxing atmosphere | KayYbHCt-RkbGcPdGOThNg | 0 | 2015-07-08 | tip |
| MLQre1nvUtW-RqMTc4iC9A | Beautiful restoration. | 1_lU0-eSWJCRvNGk78Zh9Q | 0 | 2015-10-25 | tip |
-----------------------------------------------
5 rows selected (2.341 seconds) 我们可以使用标准SQL聚合例如COUNT 0: jdbc:drill:zklocal select count(*) from /user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json;
---------
| EXPR$0 |
---------
| 591864 |
---------
1 row selected (4.495 seconds) 以及GROUP BY操作 0: jdbc:drill:zklocal select date,count(*) as tip_count from /user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json group by date order by 2 desc limit 5;
-------------------------
| date | tip_count |
-------------------------
| 2012-07-21 | 719 |
| 2012-05-19 | 718 |
| 2012-08-04 | 699 |
| 2012-06-23 | 690 |
| 2012-07-28 | 682 |
-------------------------
5 rows selected (7.111 seconds) 稍微研究一下数据我们可以看到它并不完全平坦–注意例如 hours列它是一个嵌套的JSON对象 0: jdbc:drill:zklocal select full_address,city,hours from /user/oracle/incoming/yelp/business_json b limit 5;
---------------------------
| full_address | city | hours |
---------------------------
| 4734 Lebanon Church Rd
Dravosburg, PA 15034 | Dravosburg | {Friday:{close:21:00,open:11:00},Tuesday:{close:21:00,open:11:00},Thursday:{close:21:00,open:11:00},Wednesday:{close:21:00,open:11:00},Monday:{close:21:00,open:11:00},Sunday:{},Saturday:{}} |
| 202 McClure St
Dravosburg, PA 15034 | Dravosburg | {Friday:{},Tuesday:{},Thursday:{},Wednesday:{},Monday:{},Sunday:{},Saturday:{}} |
| 1 Ravine St
Dravosburg, PA 15034 | Dravosburg | {Friday:{},Tuesday:{},Thursday:{},Wednesday:{},Monday:{},Sunday:{},Saturday:{}} |
| 1530 Hamilton Rd
Bethel Park, PA 15234 | Bethel Park | {Friday:{},Tuesday:{},Thursday:{},Wednesday:{},Monday:{},Sunday:{},Saturday:{}} |
| 301 South Hills Village
Pittsburgh, PA 15241 | Pittsburgh | {Friday:{close:17:00,open:10:00},Tuesday:{close:21:00,open:10:00},Thursday:{close:17:00,open:10:00},Wednesday:{close:21:00,open:10:00},Monday:{close:21:00,open:10:00},Sunday:{close:18:00,open:11:00},Saturday:{close:21:00,open:10:00}} |
---------------------------
5 rows selected (0.721 seconds)
0: jdbc:drill:zklocal 使用Apache Drill我们可以简单地使用点表示法来访问嵌套值。 在执行此操作时必须为表加上别名在本示例中为b 0: jdbc:drill:zklocal select b.hours from /user/oracle/incoming/yelp/business_json b limit 1;
-------
| hours |
-------
| {Friday:{close:21:00,open:11:00},Tuesday:{close:21:00,open:11:00},Thursday:{close:21:00,open:11:00},Wednesday:{close:21:00,open:11:00},Monday:{close:21:00,open:11:00},Sunday:{},Saturday:{}} |
------- 嵌套对象本身也可以嵌套-Apache Drill并不存在问题我们只是将点符号进一步链接起来 0: jdbc:drill:zklocal select b.hours.Friday from /user/oracle/incoming/yelp/business_json b limit 1;
-----------------------------------
| EXPR$0 |
-----------------------------------
| {close:21:00,open:11:00} |
-----------------------------------
1 row selected (0.238 seconds) 请注意使用反引号 引用保留的open和close关键字 0: jdbc:drill:zklocal select b.hours.Friday.open,b.hours.Friday.close from /user/oracle/incoming/yelp/business_json b limit 1;
------------------
| EXPR$0 | EXPR$1 |
------------------
| 11:00 | 21:00 |
------------------
1 row selected (0.58 seconds) 嵌套列本身就是查询中的适当对象也可以用作谓词 0: jdbc:drill:zklocal select b.name,b.full_address,b.hours.Friday.open from /user/oracle/incoming/yelp/business_json b where b.hours.Friday.open 11:00 limit 5;
---------------------------------------------------------------------------------
| name | full_address | EXPR$2 |
---------------------------------------------------------------------------------
| Mr Hoagie | 4734 Lebanon Church Rd
Dravosburg, PA 15034 | 11:00 |
| Alexions Bar Grill | 141 Hawthorne St
Greentree
Carnegie, PA 15106 | 11:00 |
| Rockys Lounge | 1201 Washington Ave
Carnegie, PA 15106 | 11:00 |
| Papa Js | 200 E Main St
Carnegie
Carnegie, PA 15106 | 11:00 |
| Italian Village Pizza | 2615 Main St
Homestead, PA 15120 | 11:00 |
---------------------------------------------------------------------------------
5 rows selected (0.404 seconds) 您会在上面的输出中注意到 full_address字段中包含换行符—我们可以使用SQL函数用逗号替换换行符 0: jdbc:drill:zklocal select b.name,regexp_replace(b.full_address,\n,,),b.hours.Friday.open from /user/oracle/incoming/yelp/business_json b where b.hours.Friday.open 11:00 limit 5;
---------------------------------------------------------------------------------
| name | EXPR$1 | EXPR$2 |
---------------------------------------------------------------------------------
| Mr Hoagie | 4734 Lebanon Church Rd,Dravosburg, PA 15034 | 11:00 |
| Alexions Bar Grill | 141 Hawthorne St,Greentree,Carnegie, PA 15106 | 11:00 |
| Rockys Lounge | 1201 Washington Ave,Carnegie, PA 15106 | 11:00 |
| Papa Js | 200 E Main St,Carnegie,Carnegie, PA 15106 | 11:00 |
| Italian Village Pizza | 2615 Main St,Homestead, PA 15120 | 11:00 |
---------------------------------------------------------------------------------
5 rows selected (1.346 seconds)查询联合 因此Apache Drill使您能够对多种格式和位置的数据运行SQL查询这本身就非常有用。 但比这更好的是它使您可以在单个查询中联合这些源。 这是在HDFS和Oracle中的数据之间进行联接的示例 0: jdbc:drill:zklocal select X.text,
. . . . . . . . . . . Y.NAME
. . . . . . . . . . . from hdfs./user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json X
. . . . . . . . . . . inner join ora.MOVIEDEMO.YELP_BUSINESS Y
. . . . . . . . . . . on X.business_id Y.BUSINESS_ID
. . . . . . . . . . . where Y.NAME Chick-fil-A
. . . . . . . . . . . limit 5;
----------------------------------------------------------------------------------
| text | NAME |
----------------------------------------------------------------------------------
| Its daddy daughter date night here and they go ALL OUT! | Chick-fil-A |
| Chicken minis! The best part of waking up Saturday mornings. :) | Chick-fil-A |
| Nice folks as always unlike those ghetto joints | Chick-fil-A |
| Great clean and delicious chicken sandwiches! | Chick-fil-A |
| Spicy Chicken with lettuce, tomato, and pepperjack cheese FTW! | Chick-fil-A |
----------------------------------------------------------------------------------
5 rows selected (3.234 seconds) 您可以为此定义一个视图 0: jdbc:drill:zklocal create or replace view dfs.tmp.yelp_tips as select X.text as tip_text, Y.NAME as business_name from hdfs./user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json X inner join ora.MOVIEDEMO.YELP_BUSINESS Y on X.business_id Y.BUSINESS_ID ;
--------------------------------------------------------------------
| ok | summary |
--------------------------------------------------------------------
| true | View yelp_tips replaced successfully in dfs.tmp schema |
--------------------------------------------------------------------
1 row selected (0.574 seconds)
0: jdbc:drill:zklocal describe dfs.tmp.yelp_tips;
--------------------------------------------------
| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
--------------------------------------------------
| tip_text | ANY | YES |
| business_name | CHARACTER VARYING | YES |
--------------------------------------------------
2 rows selected (0.756 seconds) 然后将其作为任何常规对象进行查询 0: jdbc:drill:zklocal select tip_text,business_name from dfs.tmp.yelp_tips where business_name like %Grill limit 5;
------------
| text | NAME |
------------
| Great drink specials! | Alexions Bar Grill |
| Friendly staff, good food, great beer selection, and relaxing atmosphere | Alexions Bar Grill |
| Pretty quiet here... | Uno Pizzeria Grill |
| I recommend this location for quick lunches. 10 min or less lunch menu. Soup bar ( all you can eat) the broccoli cheddar soup is delicious. | Uno Pizzeria Grill |
| Instead of pizza, come here for dessert. The deep dish sundae is really good. | Uno Pizzeria Grill |
------------
5 rows selected (3.272 seconds)查询Twitter JSON数据 这是使用Drill查询包含一些Twitter数据的本地文件的示例。 如果您想尝试自己查询文件可以在这里下载文件 。 首先我切换到使用dfs存储插件 0: jdbc:drill:zklocal use dfs;
-----------------------------------------
| ok | summary |
-----------------------------------------
| true | Default schema changed to [dfs] |
----------------------------------------- 然后尝试对文件进行选择。 注意limit 5子句–在您仅检查文件结构时非常有用。 0: jdbc:drill:zklocal select * from /user/oracle/incoming/twitter/geo_tweets.json limit 5;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input within/between OBJECT entriesFile /user/oracle/incoming/twitter/geo_tweets.json
Record 2819
Column 3503
Fragment 0:0 一个错误 那不应该发生的。 我有一个JSON文件对不对 事实证明JSON文件是每行一个完整的JSON对象。 除了不在最后一个记录上。 请注意上面的错误– 2819中给出的记录计数 [oraclebigdatalite ~]$ wc -l geo_tweets.json
2818 geo_tweets.json 因此该文件只有2818条完整的行。 嗯 让我们使用头/尾巴组合来查看该记录 [oraclebigdatalite ~]$ head -n 2819 geo_tweets.json |tail -n1
{created_at:Sun Jul 24 21:00:44 0000 2016,id:757319630432067584,id_str:757319630432067584,text:And now HillaryClinton hires DWStweets: Honorary Campaign Manager across the USA #corruption #hillarysamerica https://t.co/8jAGUu6w2f,source:TweetCaster for iOS,truncated:false,in_reply_to_status_id:null,in_reply_to_status_id_str:null,in_reply_to_user_id:null,in_reply_to_user_id_str:null,in_reply_to_screen_name:null,user:{id:2170786369,id_str:2170786369,name:Patricia Weber,screen_name:InnieBabyBoomer,location:Williamsburg, VA,url:http://lovesrantsandraves.blogspot.com/,description:Baby Boomer, Swing Voter, Conservative, Spiritual, #Introvert, Wife, Grandma, Italian, ♥ Books, Cars, Ferrari, F1 Race♥ #tcot,protected:false,verified:false,followers_count:861,friends_count:918,listed_count:22,favourites_count:17,statuses_count:2363,created_at:Sat Nov 02 19:13:06 0000 2013,utc_offset:null,time_zone:null,geo_enabled:true,lang:en,contributors_enabled:false,is_translator:false,profile_background_color:C0DEED,profile_background_image_url:http://pbs.twimg.com/profile_background_images/378800000107659131/3589f 这就是文件中的完整数据-Drill是正确的-JSON已损坏。 如果我们删除最后一条记录并创建一个新文件 geo_tweets.fixed.json 然后再次查询我们得到了一些东西 0: jdbc:drill:zklocal select text from /users/rmoff/data/geo_tweets.fixed.json limit 5;
------
| text |
------
| Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim |
| Were #hiring! Click to apply: Bench Jeweler - SEC Oracle Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
| Donald Trump accepted the Republican nomination last night. Isis claimed responsibility. |
| Obama: We must stand together and stop terrorism
Trump: We dont want these people in our country|
| Someone built a wall around Trumps star on the Hollywood Walk of Fame. #lol #nowthatsfunny … https://t.co/qHWuJXnzbw |
------
5 rows selected (0.246 seconds) 此处的text是json字段之一。 我可以select *但不是很清楚 0: jdbc:drill:zklocal select * from /users/rmoff/data/geo_tweets.fixed.json limit 5;
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| created_at | id | id_str | text | source | truncated | user | geo | coordinates | place | is_quote_status | retweet_count | favorite_count | entities | favorited | retweeted | possibly_sensitive | filter_level | lang | timestamp_ms | version | timestamp | user_name | user_screen_name | user_followers_count | user_friends_count | user_listed_count | user_favourites_count | user_statuses_count | user_created_at | place_name | place_country | hashtags_list | urls_array | urls_list | user_mentions_screen_name_list | longitude | latitude | hashtags_array | extended_entities | user_mentions_screen_name_array | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | in_reply_to_screen_name | retweeted_status | retweeted_screen_name | quoted_status_id | quoted_status_id_str | quoted_status |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Fri Jul 22 19:37:11 0000 2016 | 756573827589545984 | 756573827589545984 | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim | dlvr.it | false | {id:67898674,id_str:67898674,name:Vancouver Press,screen_name:Vancouver_CP,location:Vancouver, BC,url:http://vancouver.cityandpress.com/,description:Latest news from Vancouver. Updates are frequent.,protected:false,verified:false,followers_count:807,friends_count:13,listed_count:94,favourites_count:1,statuses_count:131010,created_at:Sat Aug 22 14:25:37 0000 2009,utc_offset:-25200,time_zone:Pacific Time (US Canada),geo_enabled:true,lang:en,contributors_enabled:false,is_translator:false,profile_background_color:FFFFFF,profile_background_image_url:http://abs.twimg.com/images/themes/theme1/bg.png,profile_background_image_url_https:https://abs.twimg.com/images/themes/theme1/bg.png,profile_background_tile:false,profile_link_color:8A1C3B,profile_sidebar_border_color:FFFFFF,profile_sidebar_fill_color:FFFFFF,profile_text_color:2A2C31,profile_use_background_image:false,profile_image_url:http://pbs.twimg.com/profile_images/515841109553983490/_t0QWPco_normal.png,profile_image_url_https:https://pbs.twimg.com/profile_images/515841109553983490/_t0QWPco_normal.png,profile_banner_url:https://pbs.twimg.com/profile_banners/67898674/1411821103,default_profile:false,default_profile_image:false} | {type:Point,coordinates:[49.2814375,-123.12109067]} | {type:Point,coordinates:[-123.12109067,49.2814375]} | {id:1e5cb4d0509db554,url:https://api.twitter.com/1.1/geo/id/1e5cb4d0509db554.json,place_type:city,name:Vancouver,full_name:Vancouver, British Columbia,country_code:CA,country:Canada,bounding_box:{type:Polygon,coordinates:[[[-123.224215,49.19854],[-123.224215,49.316738],[-123.022947,49.316738],[-123.022947,49.19854]]]},attributes:{}} | false | 0 | 0 | {urls:[{url:https://t.co/joI9GMfRim,expanded_url:http://toplocalnow.com/ca/vancouver?sectiontrends,display_url:toplocalnow.com/ca/vancouver?s…,indices:[70,93]}],hashtags:[],user_mentions:[],media:[],symbols:[]} | false | false | false | low | en | 1469216231616 | 1 | 2016-07-22T19:37:11.000Z | Vancouver Press | Vancouver_CP | 807 | 13 | 94 | 1 | 131010 | Sat Aug 22 14:25:37 0000 2009 | Vancouver | Canada | | [toplocalnow.com/ca/vancouver?s…] | toplocalnow.com/ca/vancouver?s… | | -123.12109067 | 49.2814375 | [] | {media:[]} | [] | null | null | null | null | null | {user:{},entities:{user_mentions:[],media:[],hashtags:[],urls:[]},extended_entities:{media:[]},quoted_status:{user:{},entities:{hashtags:[],user_mentions:[],media:[],urls:[]},extended_entities:{media:[]}}} | null | null | null | {user:{},entities:{user_mentions:[],media:[],urls:[],hashtags:[]},extended_entities:{media:[]},place:{bounding_box:{coordinates:[]},attributes:{}},geo:{coordinates:[]},coordinates:{coordinates:[]}} | 在twitter数据中存在根级字段例如text 和嵌套字段例如user字段中有关高音扬声器的信息。 如上所示您使用点表示法引用了嵌套字段。 现在是指出您可能会遇到的几个常见错误的好时机。 首先是不引用保留字而是检查是否Encountered .诸如“ Encountered .类的错误的第一件事Encountered . 0: jdbc:drill:zklocal select user.screen_name,text from /users/rmoff/data/geo_tweets.fixed.json limit 5;
Error: PARSE ERROR: Encountered . at line 1, column 12.
[...] 其次是在使用点表示法时声明表别名–如果您不这样做则Apache Drill会认为父列实际上是表名 VALIDATION ERROR: [...] Table user not found 0: jdbc:drill:zklocal select user.screen_name,text from dfs./users/rmoff/data/geo_tweets.fixed.json limit 5;
Aug 10, 2016 11:16:45 PM org.apache.calcite.sql.validate.SqlValidatorException
SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table user not found
Aug 10, 2016 11:16:45 PM org.apache.calcite.runtime.CalciteException
SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 13: Table user not found
Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 13: Table user not foundSQL Query null[Error Id: 1427fd23-e180-40be-a751-b6f1f838233a on 192.168.56.1:31010] (state,code0) 修正了这些错误之后我们可以看到用户的屏幕名称 0: jdbc:drill:zklocal select tweets.user.screen_name as user_screen_name,text from dfs./users/rmoff/data/geo_tweets.fixed.json tweets limit 2;
------------------------
| user_screen_name | text |
------------------------
| Vancouver_CP | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim |
| tmj_TUC_skltrd | Were #hiring! Click to apply: Bench Jeweler - SEC Oracle Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
------------------------
2 rows selected (0.256 seconds)
0: jdbc:drill:zklocal 除了嵌套对象JSON还支持数组。 Twitter数据中的一个示例是标签或URL在给定的tweet中两者都可以为零一个或多个。 0: jdbc:drill:zklocal select tweets.entities.hashtags from dfs./users/rmoff/data/geo_tweets.fixed.json tweets limit 5;
--------
| EXPR$0 |
--------
| [] |
| [{text:hiring,indices:[6,13]},{text:Job,indices:[98,102]},{text:SkilledTrade,indices:[103,116]},{text:Tucson,indices:[117,124]},{text:Jobs,indices:[129,134]}] |
| [] |
| [] |
| [{text:lol,indices:[72,76]},{text:nowthatsfunny,indices:[77,91]}] |
--------
5 rows selected (0.286 seconds) 使用FLATTEN函数每个数组条目都变成一个新行因此 0: jdbc:drill:zklocal select flatten(tweets.entities.hashtags) from dfs./users/rmoff/data/geo_tweets.fixed.json tweets limit 5;
----------------------------------------------
| EXPR$0 |
----------------------------------------------
| {text:hiring,indices:[6,13]} |
| {text:Job,indices:[98,102]} |
| {text:SkilledTrade,indices:[103,116]} |
| {text:Tucson,indices:[117,124]} |
| {text:Jobs,indices:[129,134]} |
----------------------------------------------
5 rows selected (0.139 seconds) 请注意 limit 5子句仅显示前五个数组实例实际上这只是上述列表中第一个tweet的主题标签。 要访问主题标签的文本我们使用子查询和点表示法来访问text字段 0: jdbc:drill:zklocal select ent_hashtags.hashtags.text from (select flatten(tweets.entities.hashtags) as hashtags from dfs./users/rmoff/data/geo_tweets.fixed.json tweets) as ent_hashtags limit 5;
---------------
| EXPR$0 |
---------------
| hiring |
| Job |
| SkilledTrade |
| Tucson |
| Jobs |
---------------
5 rows selected (0.168 seconds) 对于相同的结果可以通过使用公用表表达式 CTE也称为子查询分解来提高可读性 0: jdbc:drill:zklocal with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs./users/rmoff/data/geo_tweets.fixed.json tweets)
. . . . . . . . . . . select ent_hashtags.hashtags.text from ent_hashtags
. . . . . . . . . . . limit 5;
---------------
| EXPR$0 |
---------------
| hiring |
| Job |
| SkilledTrade |
| Tucson |
| Jobs |
---------------
5 rows selected (0.253 seconds) 将展平的数组与现有字段结合起来使我们能够看到类似推文列表及其相关主题标签的内容 0: jdbc:drill:zklocal with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.user.screen_name as user_screen_name from dfs./users/rmoff/data/geo_tweets.fixed.json tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp limit 10;
---------------------------------
| user_screen_name | text | hashtag |
---------------------------------
| tmj_TUC_skltrd | Were #hiring! Click to apply: Bench Jeweler - SEC Oracle Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | hiring |
| tmj_TUC_skltrd | Were #hiring! Click to apply: Bench Jeweler - SEC Oracle Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job |
| tmj_TUC_skltrd | Were #hiring! Click to apply: Bench Jeweler - SEC Oracle Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | SkilledTrade |
| tmj_TUC_skltrd | Were #hiring! Click to apply: Bench Jeweler - SEC Oracle Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Tucson |
| tmj_TUC_skltrd | Were #hiring! Click to apply: Bench Jeweler - SEC Oracle Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Jobs |
| johnmayberry | Someone built a wall around Trumps star on the Hollywood Walk of Fame. #lol #nowthatsfunny … https://t.co/qHWuJXnzbw | lol |
| johnmayberry | Someone built a wall around Trumps star on the Hollywood Walk of Fame. #lol #nowthatsfunny … https://t.co/qHWuJXnzbw | nowthatsfunny |
| greensboro_nc | #WinstonSalem Time and place announced for Donald Trumps visit to… https://t.co/6OVl7crshw #ws winston_salem_ https://t.co/l5h220otj4 | WinstonSalem |
| greensboro_nc | #WinstonSalem Time and place announced for Donald Trumps visit to… https://t.co/6OVl7crshw #ws winston_salem_ https://t.co/l5h220otj4 | ws |
| trendinaliaSG | 6. Hit The Stage
7. TTTT
8. Demi Lovato
9. Beijing
10. Donald Trump2016/7/23 03:36 SGT #trndnl https://t.co/psP0GzBgZB | trndnl |
---------------------------------
10 rows selected (0.166 seconds) 我们还可以根据主题标签进行过滤 0: jdbc:drill:zklocal with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.user.screen_name as user_screen_name from dfs./users/rmoff/data/geo_tweets.fixed.json tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp where tmp.hashtags.text Job limit 5;
---------------------------------
| user_screen_name | text | hashtag |
---------------------------------
| tmj_TUC_skltrd | Were #hiring! Click to apply: Bench Jeweler - SEC Oracle Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job |
| tmj_VAL_health | Want to work at Genesis Rehab Services? Were #hiring in #Clinton, MD! Click for details: https://t.co/4lt7I4gMZk #Job #Healthcare #Jobs | Job |
| tmj_in_retail | Want to work in #Clinton, IN? View our latest opening: https://t.co/UiimnlubYs #Job #Retail #Jobs #Hiring #CareerArc | Job |
| tmj_la_hrta | Want to work at SONIC Drive-In? Were #hiring in #Clinton, LA! Click for details: https://t.co/aQ1FrWc7iR #Job #SONIC #Hospitality #Jobs | Job |
| tmj_ia_hrta | Were #hiring! Click to apply: Department Manager - https://t.co/SnoKcwwHFk #Job #Hospitality #Clinton, IA #Jobs #CareerArc | Job |
---------------------------------
5 rows selected (0.207 seconds) 以及总结标签计数 0: jdbc:drill:zklocal with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs./users/rmoff/data/geo_tweets.fixed.json tweets)
. . . . . . . . . . . select ent_hashtags.hashtags.text,count(ent_hashtags.hashtags.text) from ent_hashtags
. . . . . . . . . . . group by ent_hashtags.hashtags.text
. . . . . . . . . . . order by 2 desc;
--------------------------------------
| EXPR$0 | EXPR$1 |
--------------------------------------
| Trump | 365 |
| trndnl | 176 |
| job | 170 |
| Hiring | 127 |
| Clinton | 108 |
| Yorkshire | 100 |
| CareerArc | 100 |
[...] 要过滤掉可能没有数组值的记录例如不是每个推特都具有的哈希标签并且没有查询的记录可能会失败请对数组的第一个索引的属性使用IS NOT NULL 0: jdbc:drill:zklocal select tweets.entities.hashtags from dfs./users/rmoff/data/geo_tweets.fixed.json tweets where tweets.entities.hashtags[0].text is not null limit 5;
--------
| EXPR$0 |
--------
| [{text:hiring,indices:[6,13]},{text:Job,indices:[98,102]},{text:SkilledTrade,indices:[103,116]},{text:Tucson,indices:[117,124]},{text:Jobs,indices:[129,134]}] |
| [{text:lol,indices:[72,76]},{text:nowthatsfunny,indices:[77,91]}] |
| [{text:WinstonSalem,indices:[0,13]},{text:ws,indices:[92,95]}] |
| [{text:trndnl,indices:[89,96]}] |
| [{text:trndnl,indices:[92,99]}] |
--------
5 rows selected (0.187 seconds) 如果您尝试比较数组本身则无法使用 0: jdbc:drill:zklocal select tweets.entities.hashtags from dfs./users/rmoff/data/geo_tweets.fixed.json tweets where tweets.entities.hashtags is not null limit 5; Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize incoming schema. Errors: Error in expression at index -1. Error: Missing function implementation: [isnotnull(MAP-REPEATED)]. Full expression: --UNKNOWN EXPRESSION--.. Fragment 0:0 [Error Id: 99ac12aa-f6b4-4692-b815-8f483da682c4 on 192.168.56.1:31010] (state,code0) 上面的示例演示了如何使用数组索引这是FLATTEN一种替代选择用于访问数组中的单个对象如果您知道它们将存在 0: jdbc:drill:zklocal select tweets.entities.hashtags[0].text as first_hashtag,text from dfs./users/rmoff/data/geo_tweets.fixed.json tweets where tweets.entities.hashtags[0].text is not null limit 5;
---------------------
| first_hashtag | text |
---------------------
| hiring | Were #hiring! Click to apply: Bench Jeweler - SEC Oracle Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
| lol | Someone built a wall around Trumps star on the Hollywood Walk of Fame. #lol #nowthatsfunny … https://t.co/qHWuJXnzbw |
| WinstonSalem | #WinstonSalem Time and place announced for Donald Trumps visit to… https://t.co/6OVl7crshw #ws winston_salem_ https://t.co/l5h220otj4 |查询CSV文件 JSON文件相对易于解释因为它们内部具有半定义的架构包括列名。 另一方面在可靠地推断列名称时CSV通常是用字符分隔的文件更像是“狂野的西部”。 如果需要可以配置Apache Drill忽略CSV文件的第一行假设它是标题也可以将它们用作列名。 如果您不这样做则查询如下所示的CSV文件 [oraclebigdatalite ~]$ head nyc_parking_violations.csv
Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
1360858775,PHW9801,OH,PAS,07/01/2015,20,SUBN,HONDA,P,61490,26160,26190,0,0044,44,44,929822,0044,0000,0653P,,BX,O,651,RIVER AVE,,0,408,D,,BBBBBBB,ALL,ALL,,0,0,-,0,,,,, 您将获得两个记录每个记录为一列宽它们是一个数组 0: jdbc:drill:zklocal select * from /user/oracle/incoming/nyc_parking/nyc_parking_violations.csv LIMIT 5;
---------
| columns |
---------
| [Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation] |
| [1360858775,PHW9801,OH,PAS,07/01/2015,20,SUBN,HONDA,P,61490,26160,26190,0,0044,44,44,929822,0044,0000,0653P,,BX,O,651,RIVER AVE,,0,408,D,,BBBBBBB,ALL,ALL,,0,0,-,0,,,,,] | 要访问CSV文件中的实际列您需要使用columns[x]语法来引用它们。 请注意各columns区分大小写并且编号从零开始 0: jdbc:drill:zklocal select columns[1] as PlateID, columns[2] as RegistrationState from /user/oracle/incoming/nyc_parking/nyc_parking_violations.csv limit 5;
------------------------------
| PlateID | RegistrationState |
------------------------------
| AR877A | NJ |
| 73268ME | NY |
| 2050240 | IN |
| 2250017 | IN |
| AH524C | NJ |
------------------------------
5 rows selected (0.247 seconds) 为了使重复处理数据更加容易您可以定义数据视图 0: jdbc:drill:zklocal create view dfs.tmp.NYC_Parking_01 as select columns[1] as PlateID, columns[2] as RegistrationState from /user/oracle/incoming/nyc_parking/nyc_parking_violations.csv;
------------------------------------------------------------------------
| ok | summary |
------------------------------------------------------------------------
| true | View NYC_Parking_01 created successfully in dfs.tmp schema |
------------------------------------------------------------------------
1 row selected (0.304 seconds) 这使用的是dfs存储插件及其内部的tmp模式它具有以下存储配置–请注意 writeable为true tmp: {location: /tmp,writable: true,defaultInputFormat: null
} 如果您使用了错误的数据库[存储插件]或架构您将获得Schema [hdfs] is immutable. 查询新视图 0: jdbc:drill:zklocal select * from dfs.tmp.NYC_Parking_01 limit 5;
--------------------------------
| PlateID | RegistrationState |
--------------------------------
| Plate ID | Registration State |
| PHW9801 | OH |
| K8010F | TN |
| GFG6211 | NY |
| GHL1805 | NY |
--------------------------------
5 rows selected (0.191 seconds) 通过视图或直接针对CSV路径您还可以运行聚合 0: jdbc:drill:zklocal select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) 1 limit 1;
-------------------
| PlateID | EXPR$1 |
-------------------
| 2050240 | 4 |
-------------------
1 row selected (15.983 seconds) 尽管对于相同的结果这不能重新运行-可能是由于limit子句 0: jdbc:drill:zklocal select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) 1 limit 1;
-------------------
| PlateID | EXPR$1 |
-------------------
| AR877A | 3 |
-------------------
1 row selected (12.881 seconds) 在幕后视图定义被写入/tmp –如果要在重新启动后保留此数据则需要移动此路径 [oraclebigdatalite parking]$ cat /tmp/NYC_Parking_01.view.drill
{name : NYC_Parking_01,sql : SELECT columns[1] AS PlateID, columns[2] AS RegistrationState\nFROM /user/oracle/incoming/nyc_parking/nyc_parking_violations.csv,fields : [ {name : PlateID,type : ANY,isNullable : true}, {name : RegistrationState,type : ANY,isNullable : true} ],workspaceSchemaPath : [ hdfs ] 您还可以使用CTAS选择时创建表创建实际表 0: jdbc:drill:zklocal create table dfs.tmp.parking as select columns[1] as PlateID, columns[2] as RegistrationState from /user/oracle/incoming/nyc_parking/nyc_parking_violations.csv;
---------------------------------------
| Fragment | Number of records written |
---------------------------------------
| 1_1 | 4471875 |
| 1_0 | 4788421 |
---------------------------------------
2 rows selected (42.913 seconds) 它存储在磁盘上按dfs配置默认情况下以Parquet格式存储 [oraclebigdatalite parking]$ ls -l /tmp/parking/
total 76508
-rw-r--r--. 1 oracle oinstall 40623288 Aug 10 22:53 1_0_0.parquet
-rw-r--r--. 1 oracle oinstall 37717804 Aug 10 22:53 1_1_0.parquetDrill的Web界面 Drill带有Web界面您可以从http//访问 8047 /对 发出查询 配置其他存储插件例如 数据库 hdfs等 指标和调试 定义存储插件 在Drill Web界面中您可以查看现有的存储插件或定义新的存储插件。 要创建一个新文件请在“存储”页面上的“ 新存储插件”下输入其名称例如hdfs 但可能会引起fred 它只是一个标签然后单击“创建”。 在“配置”框中粘贴必要的JSON定义然后单击“创建”。 如果您不想使用GUI那么还有REST API。 存储插件配置存储在Zookeeper中运行分布式Drill时或者独立运行时在sys.store.provider.local.path路径中本地sys.store.provider.local.path 。 默认情况下该/tmp位于/tmp下该/tmp在服务器重新启动时被清除。 要保留自定义存储配置请修改drill-override.conf的sys.store.provider.local.path 例如 drill.exec: {cluster-id: drillbits1,zk.connect: localhost:2181sys.store.provider.local.path/home/oracle/drill/
}处理文件系统数据 这是一个使Drill能够访问CDH群集的HDFS的存储配置示例 {type: file,enabled: true,connection: hdfs://cdh57-01-node-01:8020/,config: null,workspaces: {root: {location: /,writable: true,defaultInputFormat: null}},formats: {csv: {type: text,extensions: [csv],delimiter: ,},json: {type: json,extensions: [json]}}
} 除了HDFS的connection参数本身外此配置中的重要部分是formats部分。 这就告诉Drill如何处理找到的文件而最终用户不必显式声明其类型。 对于基于文件系统的插件dfs 可以包括本地文件HDFS甚至Amazon S3您可以浏览可用的“表” 列出HDFS中的文件以前use hdfs;选择 0: jdbc:drill:zklocal show files;
-----------------------------------------------------------------------------------------------------------------------------
| name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime |
-----------------------------------------------------------------------------------------------------------------------------
| hbase | true | false | 0 | hbase | supergroup | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-07-25 14:46:08.212 |
| share | true | false | 0 | hdfs | supergroup | rwxrwxrwx | 1969-12-31 19:00:00.0 | 2016-05-15 12:28:08.152 |
| solr | true | false | 0 | solr | solr | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-06-01 18:34:50.716 |
| tmp | true | false | 0 | hdfs | supergroup | rwxrwxrwt | 1969-12-31 19:00:00.0 | 2016-06-24 04:54:41.491 |
| user | true | false | 0 | hdfs | supergroup | rwxrwxrwx | 1969-12-31 19:00:00.0 | 2016-06-21 15:55:59.084 |
| var | true | false | 0 | hdfs | supergroup | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-11 17:53:29.804 |
-----------------------------------------------------------------------------------------------------------------------------
6 rows selected (0.145 seconds) 显示给定路径中的文件 0: jdbc:drill:zklocal show files in /user/oracle;
----------------------------------------------------------------------------------------------------------------------------------
| name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime |
----------------------------------------------------------------------------------------------------------------------------------
| .Trash | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-23 20:42:34.815 |
| .sparkStaging | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-07-06 03:56:38.863 |
| .staging | true | false | 0 | oracle | oracle | rwx------ | 1969-12-31 19:00:00.0 | 2016-06-01 18:37:04.005 |
| incoming | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-08-03 05:34:12.38 |
| mediademo | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-06-01 18:59:45.653 |
| moviedemo | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:02:55.652 |
| moviework | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:03:01.497 |
| oggdemo | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:03:01.552 |
| oozie-oozi | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:03:01.651 |
----------------------------------------------------------------------------------------------------------------------------------
9 rows selected (0.428 seconds) 您还可以通过指定通配符匹配来查询多个文件。 以下是可用文件的截断列表 0: jdbc:drill:zkcdh57-01-node-01.moffatt.me: show files in hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/;
---------------------------------------------------------------------------------------------------------------------------------------------
| name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime |
---------------------------------------------------------------------------------------------------------------------------------------------
| FlumeData.1466176113171 | false | true | 1055675 | rmoff | rmoff | rw-r--r-- | 2016-08-10 21:28:27.072 | 2016-06-17 16:08:38.023 |
| FlumeData.1466176113172 | false | true | 1051411 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.756 | 2016-06-17 16:08:40.597 |
| FlumeData.1466176113173 | false | true | 1054734 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.752 | 2016-06-17 16:08:43.33 |
| FlumeData.1466176113174 | false | true | 1050991 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.743 | 2016-06-17 16:08:44.361 |
| FlumeData.1466176113175 | false | true | 1053577 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.748 | 2016-06-17 16:08:45.162 |
| FlumeData.1466176113176 | false | true | 1051965 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.752 | 2016-06-17 16:08:46.261 |
| FlumeData.1466176113177 | false | true | 1049555 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:47.425 |
| FlumeData.1466176113178 | false | true | 1050566 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:48.23 |
| FlumeData.1466176113179 | false | true | 1051751 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.756 | 2016-06-17 16:08:49.381 |
| FlumeData.1466176113180 | false | true | 1052249 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.757 | 2016-06-17 16:08:50.042 |
| FlumeData.1466176113181 | false | true | 1055002 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:50.896 |
| FlumeData.1466176113182 | false | true | 1050812 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:52.191 |
| FlumeData.1466176113183 | false | true | 1048954 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.757 | 2016-06-17 16:08:52.994 |
| FlumeData.1466176113184 | false | true | 1051559 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.773 | 2016-06-17 16:08:54.025 |
[...] 计算一个文件 FlumeData.1466176113171 中的记录数 0: jdbc:drill:zkcdh57-01-node-01.moffatt.me: SELECT count(*) FROM table(hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171(type json));
---------
| EXPR$0 |
---------
| 277 |
---------
1 row selected (0.798 seconds) 在多个文件 FlumeData.146617611317* 中 0: jdbc:drill:zkcdh57-01-node-01.moffatt.me: SELECT count(*) FROM table(hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.146617611317*(type json));
---------
| EXPR$0 |
---------
| 2415 |
---------
1 row selected (2.466 seconds) 在文件夹 * 中的所有文件中 0: jdbc:drill:zkcdh57-01-node-01.moffatt.me: SELECT count(*) FROM table(hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/*(type json));
---------
| EXPR$0 |
---------
| 7414 |
---------
1 row selected (3.867 seconds) 甚至跨多个文件夹 0: jdbc:drill:zkcdh57-01-node-01.moffatt.me: SELECT count(*) FROM table(hdfs./user/flume/incoming/twitter/2016/06/*/*(type json));
---------
| EXPR$0 |
---------
| 206793 |
---------
1 row selected (87.545 seconds)查询不带识别扩展名的数据 Drill依赖于orer中存储扩展配置的format子句来确定如何根据文件的扩展名解释文件。 您将不会总是拥有可用的或已定义的扩展。 如果尝试查询此类数据您将走不远。 在此示例中我在JSON格式但没有.json后缀的HDFS上查询数据 0: jdbc:drill:zkcdh57-01-node-01.moffatt.me: SELECT text FROM hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171 limit 5;
Error: VALIDATION ERROR: From line 1, column 18 to line 1, column 23: Table hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171 not foundSQL Query null 不用担心–您可以将它们声明为查询语法的一部分。 0: jdbc:drill:zkcdh57-01-node-01.moffatt.me: SELECT text FROM table(hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171(type json)) limit 5;
------
| text |
------
| RT jjkukrl: susu bayi jg lagi mahal nih ugh ayah harus semangat cari duit ^^9 https://t.co/2NvTOShRbI |
| Oracle Java 1Z0-808 Web Exam Simulator https://t.co/tZ3gU8EMj3 |
| TribuneSelatan ahaha kudu gaya atuh da arek lebarann ahahaha |
| Short impression of yesterdays speech. What a great day it was! #lifeatoracle #team #salesincentive #oracle https://t.co/SVK2ovOe3U |
| Want to work at Oracle? Were #hiring in New York! Click for details: https://t.co/NMTo1WMHVw #Sales #Job #Jobs #CareerArc |
------
5 rows selected (1.267 seconds)存储配置– Oracle 根据文档可以很容易地查询RDBMS例如Oracle中的数据。 只需将JDBC驱动程序复制到Apache Drill的jar文件夹中 cp /u01/app/oracle/product/12.1.0.2/dbhome_1/jdbc/lib/ojdbc7.jar /opt/apache-drill-1.7.0/jars/3rdparty/ 然后添加必要的存储配置我称之为ora {type: jdbc,driver: oracle.jdbc.OracleDriver,url: jdbc:oracle:thin:moviedemo/welcome1localhost:1521/ORCL,username: null,password: null,enabled: true
} 如果遇到错误 Please retry: error (unable to create/ update storage)然后检查目标Oracle数据库是否已启动密码是否正确等等。 然后您可以在Hive中查询数据 0: jdbc:drill:zklocal use ora.MOVIEDEMO;
---------------------------------------------------
| ok | summary |
---------------------------------------------------
| true | Default schema changed to [ora.MOVIEDEMO] |
---------------------------------------------------
1 row selected (0.205 seconds)0: jdbc:drill:zklocal show tables;
---------------------------------------------
| TABLE_SCHEMA | TABLE_NAME |
---------------------------------------------
| ora.MOVIEDEMO | ACTIVITY |
| ora.MOVIEDEMO | BDS_CUSTOMER_RFM |
| ora.MOVIEDEMO | BUSINESS_REVIEW_SUMMARY |
[...]0: jdbc:drill:zklocal select * from ACTIVITY limit 5;
-----------------------
| ACTIVITY_ID | NAME |
-----------------------
| 3.0 | Pause |
| 6.0 | List |
| 7.0 | Search |
| 8.0 | Login |
| 9.0 | Logout |
-----------------------
5 rows selected (1.644 seconds) 如果出现错误DATA_READ错误尝试设置SQL查询时JDBC存储插件失败。 然后在Apache Drill中启用详细错误以查看问题所在 0: jdbc:drill:zklocal ALTER SESSION SET exec.errors.verbose true;
--------------------------------------
| ok | summary |
--------------------------------------
| true | exec.errors.verbose updated. |
--------------------------------------
1 row selected (0.154 seconds)0: jdbc:drill:zklocal select * from ora.MOVIEDEMO.YELP_BUSINESS limit 1;
Error: DATA_READ ERROR: The JDBC storage plugin failed while trying setup the SQL query.sql SELECT *
FROM MOVIEDEMO.YELP_BUSINESS
plugin ora
Fragment 0:0[Error Id: 40343dd5-1354-48ed-90ef-77ae1390411b on bigdatalite.localdomain:31010](java.sql.SQLException) ORA-29913: error in executing ODCIEXTTABLEOPEN callout
ORA-29400: data cartridge error
KUP-11504: error from external driver: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused 这是Oracle正在查询的外部表的问题 ORA-29913: error in executing ODCIEXTTABLEOPEN 。 它实际上是Hive表上的Oracle外部表显然Drill可以直接查询-但是我们只是在这里进行沙箱测试… 查询执行 正如Oracle具有基于成本的优化程序CBO来帮助确定如何执行查询以及最有效地执行查询一样Apache Drill的执行引擎也可以确定如何实际执行您提供的查询。 这还包括如何将其拆分为多个节点“钻头”如果可用以及优化例如在某些情况下进行分区修剪 。 您可以在此处详细了解查询执行的工作方式 并在此处观看有关它的详细说明 。 要查看查询的解释计划请使用explain plan 0: jdbc:drill:zklocal !set maxwidth 10000
0: jdbc:drill:zklocal explain plan for select date,count(*) as tip_count from /user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json group by date order by 2 desc limit 5;
------------
| text | json |
------------
| 00-00 Screen
00-01 Project(date[$0], tip_count[$1])
00-02 SelectionVectorRemover
00-03 Limit(fetch[5])
00-04 SelectionVectorRemover
00-05 TopN(limit[5])
00-06 HashAgg(group[{0}], tip_count[$SUM0($1)])
00-07 HashAgg(group[{0}], tip_count[COUNT()])
00-08 Scan(groupscan[EasyGroupScan [selectionRoothdfs://localhost:8020/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json, numFiles1, columns[date], files[hdfs://localhost:8020/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json]]]) | {head : {version : 1,generator : {type : ExplainHandler,info : [...] 您还可以使用Drill Web界面查看有关查询执行方式的信息 钻探资源管理器 MapR Drill ODBC驱动程序附带一个名为Drill Explorer的工具。 这是一个GUI使您可以通过导航数据库存储插件和其中的文件夹/文件预览数据甚至创建视图来浏览数据。 钻探客户 在Drill客户端中有多种可用设置 0: jdbc:drill:zklocal !set
autocommit true
autosave false
color true
fastconnect true
force false
headerinterval 100
historyfile /home/oracle/.sqlline/history
incremental true
isolation TRANSACTION_REPEATABLE_READ
maxcolumnwidth 15
maxheight 56
maxwidth 1000000
numberformat default
outputformat table
propertiesfile /home/oracle/.sqlline/sqlline.properties
rowlimit 0
showelapsedtime true
showheader true
shownestederrs false
showwarnings true
silent false
timeout -1
trimscripts true
verbose false 要更改一个例如显示的输出宽度 0: jdbc:drill:zklocal !set maxwidth 10000 要连接到远程Drill请指定存储Drillbit连接信息的Zookeeper节点 rmoffasgard-3:apache-drill-1.7.0 bin/sqlline -u jdbc:drill:zkcdh57-01-node-01.moffatt.me:2181,cdh57-01-node-02.moffatt.me:2181,cdh57-01-node-03.moffatt.me:2181结论 Apache Drill是一个功能强大的工具用于针对不同的数据源使用熟悉的查询语言SQL。 在小范围内仅能够通过JSON之类的结构化文件进行切片和切块是一个巨大的胜利。 在更大范围内尝试跨机器集群查询更大数量的数据时也许与诸如Impala之类的工具进行比较时尝试Apache Drill进行比较是很有意思的。 有关Apache Drill的更多信息请参见如何从OBIEE中访问Drill。 翻译自: https://www.javacodegeeks.com/2016/09/guide-getting-started-apache-drill.htmldrill apache