Product Tech

A Quick Tutorial for Apache Drill and Quobyte

By  Björn Kolbeck  on  

In this post I’ll show you how to combine Quobyte’s software storage and Apache Drill – a tool to query non-relational data stores (such as files) in parallel, using a SQL-like syntax. Both form a natural duo that allows you to speed up your number crunching significantly. Apache Drill enhances your productivity by enabling SQL-on-everything whereas Quobyte delivers the throughput required for high-performance data analysis.

As an ex-Googler I was particularly excited to finally be able to work with something akin to Dremel again, which I’d often used at work and had missed quite dearly. If you’re a data scientist or researcher, you’ll most likely know about the tool already and I’m sure, you’ll share my view that Apache Drill is a great addition to the data scientist’s toolbox.

Data Access with a High Throughput for Faster Drilling

Apache Drill has two components: The client module – either in the form of a shell or as a web frontend – and the drillbits. Drillbits run on all machines that contribute compute power and execute your tasks in parallel, similar to the MapReduce program. They can process a range of file types including csv, JSON or Hadoop Sequence Files.

Illustration Quobyte and Apache Drill

Now, here’s where Quobyte comes in: The drillbits need access to the same data and they need it fast – and how to better get there than by using a shared file system with high-performance throughput? That’s what Quobyte does; hence it’s the ideal storage backend for the job. It not only provides parallel throughput without bottlenecks, but it’s also designed to maximize performance – both on flash drives and HDDs. That means, even your largest Drill jobs with data stored on hard drives will get the massive throughput they require. Did you notice the implicit and beneficial side effect here? You won’t need a hot or cold tier any longer since all your data is always accessible at high speed – so you can run, e.g., multi-year data queries.

Since Quobyte allows you to access the same files and volumes from any interface, you can easily ingest the data and inspect or visualize it on your Windows (or any other) workstation.

Setting Up the Data Pipeline

To get started with our example, you need at least four nodes with hard drives. Flash drives will also work, but why waste flash on a sequential workload? We assume the four servers are called server1, server2, server3 and server4 and there should be no firewalls configured between them.

Install Quobyte

If you don’t have Quobyte running on the nodes yet, go to https://www.quobyte.com/signup to get your free 45-day trial version.

  • Start the installation with ./install_quobyte.sh
  • For the purposes of this post, we’ll assume that you’re running Quobyte services on the same machines as the drillbits – i.e. you should specify each machine as server and client during Quobyte’s install.
    If you prefer to run the drillbits on separate machines, specify your drillbit machines as clients.
  • Once Quobyte is running, create a new volume called “drilltest”. You can use any volume config you see fit, but you should opt for erasure coding as it’s best suited for our case at hand.

Screenshot Quobyte Web UI

Install ZooKeeper

Next, we’ll install Apache ZooKeeper on server1, which is required for a distributed Apache Drill set-up:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/tmp/zookeeper
clientPort=2181
  • Start ZooKeeper: bin/zkServer.sh start

For a production set-up you’d also want zookeeper to be replicated, but we skip this step for simplicity.

Install Apache Drill

Now we can install Apache Drill on the drillbits, i.e. all four servers:

  • Download Drill from this page: (Get the mirror link and wget on your machines) http://www.apache.org/dyn/closer.cgi/drill/drill-1.12.0/apache-drill-1.12.0.tar.gz
  • Untar: tar xzf apache-drill-1.12.0.tar.gz
  • Change to the apache-drill-1.12.0 directory
  • Edit conf/drill-override.conf and set the target for zk.connect to server1:2181. You can leave the cluster ID as is, or change it to your own choice. Just make sure you have the same cluster ID on all your servers!
  • Start the drillbit: bin/drillbit.sh start
  • Check the log for a successful start-up. The command from the line above will print the location of the log for errors.

Once all four drillbits are up and running, we can start working with Drill:

  • Start the drill console with bin/drill-conf
  • Run SELECT * FROM sys.drillbits; to check that all drillbits are actually there.
[bjoern@server2 apache-drill-1.12.0]$ bin/drill-conf
/home/bjoern/apache-drill-1.12.0/bin/drill-config.sh: line 317: lsb_release: command not found
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
apache drill 1.12.0 
"say hello to my little drill"
0: jdbc:drill:> SELECT * FROM sys.drillbits;
+------------------------------+------------+---------------+------------+----------+----------+---------+
|           hostname           | user_port  | control_port  | data_port  | current  | version  |  state  |
+------------------------------+------------+---------------+------------+----------+----------+---------+
| server1.c.eda-eval.internal  | 31010      | 31011         | 31012      | true     | 1.12.0   | ONLINE  |
| server2.c.eda-eval.internal  | 31010      | 31011         | 31012      | false    | 1.12.0   | ONLINE  |
+------------------------------+------------+---------------+------------+----------+----------+---------+
2 rows selected (3.046 seconds)
0: jdbc:drill:> 
  • Next open the web UI in your browser: server1:8047 and add Quobyte as a data source:

Screenshot Drill Web UI

In the UI, add a new data source called “quobyte” with the following config:

{
 "type": "file",
 "enabled": true,
 "connection": "file:///",
 "config": null,
 "workspaces": {
   "volumes": {
     "location": "/quobyte",
     "writable": false,
     "defaultInputFormat": "json",
     "allowAccessOutsideWorkspace": false
   }
 },
 "formats": {
   "csv": {
     "type": "text",
     "extensions": [
       "csv"
     ],
     "extractHeader": true,
     "delimiter": ","
   },
   "json": {
     "type": "text",
     "extensions": [
        "json"
     ]
   }
 }
}

Screenshot Drill Config

Analysing Data with Quobyte and Apache Drill

Now that everything’s set up and working, let’s load a simple dataset via the Quobyte client with a simple wget and use data.gov. Here’s one which contains demographic data from NYC sorted by ZIP code:

cd /quobyte/drilltest/
[user@server1 drilltest]$ wget -O demographics.csv 'https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv?accessType=DOWNLOAD'

Once the download finishes, switch to the Apache Drill web console and click on “Query”. Enter the following SQL query to find all the ZIP codes with an equal distribution of women and men:

SELECT * FROM quobyte.volumes.`/drilltest/demographics.csv` WHERE COUNT_FEMALE = COUNT_MALE

And if we count the result, we find that there are 137 districts where that’s the case.

But let’s move on to a more interesting (and larger) data set: the flight delay statistics in the US for 2017 (which you can download here: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236; make sure to select the fields AirlineID, Carrier, and DepDelay).

Unzip the file in /quobyte/drilltest and rename it to ontime.csv.

After some cleaning up (i.e. delete some NULL values in our delay column), we can now use Drill to find the most delayed flights in 2017:
SELECT * FROM quobyte.volumes.`/drilltest/ontime.csv` ORDER BY DEP_DELAY DESC LIMIT 5;

A whopping 45 hours delay for the one topping this list, followed by a close 32 to 37 hours delay for the rest of our top 5. I seriously hope you weren’t booked on one of those flights…

And that’s it for our quick tutorial. You can now go ahead and try out other datasets or play around some more with the flight data. You’ve learned how to load new datasets through your Quobyte client for MacOS, Linux, or Windows and how to quickly analyze them with Drill.

Happy drilling!

Photo of Björn Kolbeck

Written by

Björn is Quobyte’s co-founder and CEO.