Hive provides a good way for you to evaluate your data on HDFS.
It is the common case where you create your data and then want to use hive to evaluate it.
In that case, creating a external table is the approach that makes sense.
In this post, we are going to discuss a more complicated usage where we need to include more than one partition fields into this external table. And the original data on HDFS is in JSON.
For the difference between managed table and external table, please refer to this SO post.
Here is your data
In this post, we assume your data is saved on HDFS as /user/coolguy/awesome_data/year=2017/month=11/day=01/\*.json.snappy.
The data is well divided into daily chunks. So, we definitely want to keep year, month, day as the partitions in our external hive table.
Make hive be able to read JSON
Since every line in our data is a JSON object, we need to tell hive how to comprehend it as a set of fields.
To achieve this, we are going to add an external jar.
There are two jars that I know of could do the job:
To add the jar you choose to hive, simply run ADD JAR in the hive console:
Note: The path here is the path to your jar on the local machine. But you can still specify a path on HDFS by specifying hdfs:// prefix.
Create the external table
By now, all the preparation is done. The rest of the work is pretty straight forward:
Tell hive where to look for the data.
Tell hive which ones are the fields for partitions.
Tell hive which library to use for JSON parsing.
So, the HQL to create the external table is something like:
This HQL uses hive-hcatalog-core-X.Y.Z.2.4.2.0-258.jar to parse JSON. For the usage of json-serde-X.Y.Z-jar-with-dependencies.jar, change ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' to ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'.
There are two things you want to be careful about:
The fields for partition shouldn’t be in the <field-list>.
The part of /year=2017/month=11/day=01/ in the path shouldn’t be in the LOCATION.
And here you go, you get yourself an external table based on the existing data on HDFS.
However…
Then soon enough, you will find this external table doesn’t seem to contain any data.
That is because we need to manually add partitions into the table.
When you finish the ingestion of /user/coolguy/awesome_data/year=2017/month=11/day=02/, you should also run
In this post, we will use a Flink local setup with savepoint configured, consuming from a local kafka instance.
We also will have a very simple kafka producer to feed sequential numbers to kafka.
To check whether the savepointing is actually working, we will crucially stop the flink program, and restore it from the last savepoint, then check the consumed events is in consistency with the producer.
Setup local Kafka
To setup a local Kafka is pretty easy. Just follow the official guide, then you will have a zookeeper and Kafka instance in your local machine.
Start zookeeper:
Start Kafka:
Create our testing topic:
Create a script to feed event
As mentioned above, the producer is very simple.
I’m going to use python to create this producer.
First, the Kafka dependency:
Then create a python script producer.py like this:
Then run this script to start feeding events:
I recommend you also use a simple script to create a consumer to check whether the current setup is working. I highly recommend using logstash to run a simple consumer with config file like:
But any other kind of consumer will easily do the job.
Setup Flink
First you will need to download the flink of the version you want/need.
After download the package, unpack it. Then you will have everything you need to run flink on your machine.
Assume that Java and mvn are already installed.
Setup local Flink cluster
This will be the tricky part.
First we need to change the config file: ./conf/flink-conf.yaml.
In this file, you can specify basically all the configuration for the flink cluster. Like job manager heap size, task manager heap size.
But in this post, we are going to focus on save point configurations.
Add following line to this file:
With this field specified, the flink will use this folder as the storage of save point files.
In real world, I believe people usually use HDFS with hdfs:///... or S3 to store their save points.
In this experiment, we will use a local setup for flink cluster by running ./bin/start-local.sh.
This script will read the configuration from ./conf/flink-conf.yaml, which we just modified.
Then the script will start a job manager along with a task manager on localhost.
You could also change the number of slots in that task manager by modify the ./conf/flink-conf.yaml,
but that’s not something we needed for the current topic. For now we can just run:
Prepare the job
Consider flink cluster is just distributed worker system with enpowered streaming ability.
It is just a bunch of starving workers (and their supervisors) waiting for the job assignment.
Although we have set up the cluster, we still need to give it a job.
Here is a simple Flink job we will use to check out the save point functionality:
It will consume the events from our kafka, and write it to local files divided by minute.
Start experiment
To submit our job, first build the above project by:
There will be a fat jar under your target folder: <project-name>.<version>.jar
To submit this jar as our job, run:
Once after it starts running, you will find files start to be generated in the basePath.
Restart the job without save point
After the job runs for a while, cancel the job in the flink UI.
Check the lastest finished file before cancellation, and find the last line of this file. In my experiment, it’s {"idx": 2056}.
Then start the job again:
After a few minute, check first finished file after restart, and find the first line of this file. In my experiment, it’s {"idx": 2215}.
This means, there are events missing when we restart job without savepoin.
Finished file is the file that have been checked by flink’s check point.
It is file that without prefix in-progress or suffix pending
Initially, a file under writing is in in-progress state.
When this file stop being written for a while(can be specified in config), this file become pending.
A checkpoint will turn pending files to the finished files.
Only finished file should be considered as the consistant output of flink. Otherwise you will get duplication.
For more info about checkpointing, please check their official document.
Restart with save point
Let’s try save pointing.
This time, we will create a save point before cancel the job.
Flink allows you to make save point by executing:
The <job-id> can be found at the header of the job page in flink web UI.
After you run this command, flink will tell you the path to your save point file. Do record this path.
Then, we cancel the job, and check the lastest finished file before cancellation, and find the last line of this file. In my experiment, it’s {"idx": 4993}.
Then we restart the job.
Because we want to restart from a save point, we need to specify the save point when we start the job:
After a few minute, check first finished file after restart, and find the first line of this file. In my experiment, it’s {"idx": 4994}, which is consistant with the number before cancellation.
General thoughts
Flink’s save pointing is much easier than what I expect.
The only thing I should constantly keep in mind is that we need record the save point file path carefully when we use crontab to create save points.
Other than that, flink seems to handle all the data consistancy.
As part of the further experiment and research, I think it could be very useful if we try flink’s save point in the cluster setup with multiple task managers.
React with Redux has become a very common stack these days.
People with very little background of front-end, like me, can write something that turns out not bad.
However, testing a React component connected to the Redux can sometimes be painful.
In this post, I want to share some of the ideas about how to test against a Redux container.
First of all, we need a container to be tested against.
In the following gist, we have a text box that:
saves the value of input into component’s local state
submits the input value via Redux action dispatch to the Redux store.
shows the input value in the Redux store.
Note this component is only used for this example. Separating the text box and the representative part is more practical in a real project.
I highly recommend using Enzyme with all your React components’ tests. This utility makes it much easier to traverse and assert your React/Redux components.
Besides, it now works pretty well with Mocha and Karma and other stuff you might use in your test.
Redux store wrapper for the tests
One of the painful parts of testing a Redux container is that we can’t just render a container and then evaluate the DOM anymore. That’s because container will ask for the Redux store in mapStateToProps() when you connect your component to Redux, and without it, test fails.
So, before we start to write testing code, we will need a “Redux wrapper” for all of our testing components to provide Redux store, which is something pretty much like what we have usually done to our root component in the application.
Luckily, Enzyme provides a shallow-render function that can take a context as second parameter, where we can put a Redux store for our tests.
Make sure it’s working
Let’s first try to shallow-render our component in the tests with the shallowWithStore we created. We are going to mock our Redux store by using createMockStore(state) from redux-test-utils.
By using shallowWithStore with the mocked store, we can finally shallow-render our container.
To test the DOM structure, you can use find() to traverse the rendered result, then use prop() to get the desired attribute then make the assertion.
To test event handler, you can first get the element with find(), then use simulate() to trigger the event. After that, just make assertion as usual.
Please note that we call dive() before calling find().
Test non-Redux feature with Enzyme
There are two non-Redux features we might want to test:
DOM structure.
handleInput function in the component.
Achieving these two goals is extremely simple with the Enzyme.
To test the DOM structure, you can use find() to traverse the rendered result, then use prop() to get the desired attribute then make the assertion.
To test event handler, you can first get the element with find(), then use simulate() to trigger the event. After that, just make assertion as usual.
Please note that we call dive() before calling find().
I used Chai for assertion. It might look different from what you would write depending on your setup.
One thing that should be careful about is that if we shallow-render a connected container, the result of rendering will not reach the DOM inside the component.
This means the whole DOM inside ShowBox will not be rendered, which will cause following assertion to fail:
The reason is that shallow-render only renders the direct children of a component.
When we shallow-render connected container <ConnectedShowBox />, the deepest level it gets rendered is <ShowBox />. All the stuff inside the ShowBox’s render() will not be rendered at all.
To make the result of shallow-render go one layer deeper( to render the stuff inside the <ShowBox />), we need to call dive() first.
You can also test mapStateToProps here by simply changing the mocked state(testState).
Test Redux dispatch mapDispatchToProps
Test mapDispatchToProps with Enzyme is really easy.
Just like what we did to test handleInput(). First we use find() to get the element that we need to trigger, then use the simulate() to trigger the event.
Finally use isActionDispatched() to check whether the expected action is dispatched.
And that is it! I think this is enough for most the use cases in testing.
To use plugins supported by jQuery, you will have to add jQuery to your component.
Step 1:
First you need to add the jQuery into your project.
To achieve that you could either add the jQuery link to your html wrapper, or simply run
npm i -S jquery
then import it in your component.
Step 2:
After you import jQuery into your project, you are already able to use it in your components.
But if you are using Server Side Rendering, this may not work for you, because jQuery is supposed to run on a rendered DOM document. So in SSR, you can only expect jQuery work on client side.
With that being said, the solution for usage jQuery in SSR is simple. Add your jQuery code to the
This is a for-beginner tutorial for those who already understand how Kafka works and the basic functionality of Kerberos.
Purpose
Setup a simple pipeline for stream processing within the VMs.
Integrate Kafka with Kerberos’ SASL authentication.
Console pipeline to check if our setup works.
The versions of components in this post will be:
Ubuntu: 16.04.1
Kafka: 0.10.1.0 with scala 2.11
Kerberos: 5
Prepare the Servers
We are going to have 3 Ubuntu servers in the VirtualBox:
server-kerberos A server for Kerberos.
server-kafka A server for both Zookeeper and Kafka Broker.
server-kafka-client A server for Kafka clients(producer and consumer).
You can also make server-kafka-client into two different servers, but the setups for these two servers will be exactly the same.
To make things simple and clean, this post will only use one server to host both producer and consumer.
For the same reason, server-kafka can also be splited into two servers.
Installation of Kerberos and Kafka
Use VirtualBox to create all these 3 servers with Ubuntu.
Go to your VirtualBox manager, and make sure all your boxes’ network adapters are set to NAT.
For server-kerberos:
To install kerberos, enter:
apt-get install krb5-admin-server krb5-kdc
During the installation, you will be asked for several settings. Enter the settings like below:
Default Kerberos version 5 realm? [VISUALSKYRIM]
Kerberos servers for your realm? [kerberos.com]
Administrative server for your realm? [kerberos.com]
For server-kafka, server-kafka-client:
Install krb5-user for SASL authentication:
sudo apt-get install krb5-user
During this installation, you will be asked the same questions. Just answer them with the same answer:
Default Kerberos version 5 realm? [VISUALSKYRIM]
Kerberos servers for your realm? [kerberos.com]
Administrative server for your realm? [kerberos.com]
These question will generate a Kerberos config file under you /etc/krb5.config.
Install kafka:
wget http://ftp.meisei-u.ac.jp/mirror/apache/dist/kafka/0.10.1.0/kafka_2.10-0.10.1.0.tgz
tar -xzf kafka_2.11-0.10.1.0.tgz
cd kafka_2.11-0.10.1.0
Later in this post, you will need to transfer authentication files(keytabs) between servers.
For that purpose, this post will use scp, and openssh-server will be installed.
If you are going to use other methods to transfer files from server-kerberos, feel free to skip this installation.
apt-get install openssh-server
Setting up servers
Before starting to set up servers, we need to change our VMs’ network adapters to Host-Only and reboot to get an individual IP address for each VM.
After that, go to each server to get their IP address by
ifconfig
Then input those IP address and the hostnames into /etc/hosts. Something like this:
Move /tmp/zookeeper.keytab and /tmp/kafka.keytab to your server-kafka, and move /tmp/kafka-client.keytab to your server-kafka-client.
Kafka Server
Just like real world, every individual(program) in the distributed system must tell the authority(Kerberos) two things to identify itself:
The first thing is the accepted way for this role to be identified.
It’s like in America, people usually use drive license, and in China people use ID card, while in Japan, people use so called MyNumber card.
The second thing is the file or document that identifies you according to your accepted identify method.
The file to identify the role in our SASL context, is the keytab file we generated via the kadmin just now.
Suggest you put zookeeper.keytab and kafka.keytab under /etc/kafka/ of you server-kafka.
We need a way to tell our program where to find this file and how to hand it over to the authority(Kerberos). And that will be the JAAS file.
We create the JAAS files for Zookeeper and Kafka and put it to /etc/kafka/zookeeper_jaas.conf and /etc/kafka/kafka_jaas.conf.
To make this post easy and simple, I choose to modify the the bin/kafka-run-class.sh, bin/kafka-server-start.sh and bin/zookeeper-server-start.sh to insert those JVM options into the launch command.
To enable SASL authentication in Zookeeper and Kafka broker, simply uncomment and edit the config files config/zookeeper.properties and config/server.properties.
The setting up for you server-kafka-client is quite similar to what you’ve just done for server-kafka.
For JAAS file, because we are going to use the same principal and keytab for both producer and consumer in this case, we only need to create one single JAAS file /etc/kafka/kafka_client_jaas.conf: