This post is for those who already understand the concept of stream processing and basic functionality of Kafka and Spark.And this post is also for people who are just dying to try Kafka and Spark right away. :)
Setup a simple pipeline for stream processing in your local machine.
Integrate Spark consumer to the Kafka.
Implement a word frequency processing pipeline.
The versions of components in this post will be:
Kafka: 0.10.1.0 with scala 2.11
Spark streaming: 2.10
Spark streaming kafka: 2.10
Step 1 Install Kafka:
tar -xzf kafka_2.11-0.10.1.0.tgz
When above commands are executed, you will see the message indicating that workers has been connected to the master in the master’s log file.
And that is how the use docker and Akka to build a load balanced distributed system.
Please make sure your master and workers are in the same VPC network when deploy onto AWS’s EC2 servers.
Currently, deployments need different config files, which is annoying. You can use --env-file and -e to forge a config file inside the container, so that role-based config is no longer needed. Only a config template will be needed.
It is ok that we add worker node to our master node on runtime. But when we try to disconnect worker node from master node, there will be something wrong. We can add a mechanism to detect dying worker and disconnect them.
You can find above source code at:
UPDATE 2016-02-03: Change of the opinion of deviding table into smaller ones.
BigQuery is a very powerful data analytics service.
A very heavy query on massive data may only take seconds in BigQuery.
Among all the interesting features of BigQuery, one thing you should most keep in mind,
is the way BigQuery charge your money.
How does BQ charge you
In general, it charges you for the size of the data you evaluate in your query.
The more data your query evaluate, the poorer BQ will get you.
One thing that will draw attention is what kind of operation will be considered as EVALUATE?
Basically, every column mentioned in your query will be considered as evaluated data.
Let’s see. For example, I have a table with around 1,000,000 rows’ data, which is around 500 MB in total.
First, let’s fetch all the data from the table:
Because this query directly fetches all the data from the table, the size of data will be charged is, without doubt, 500MB.
If we only fetch a part of the columns, the cost will dramatically fall.
According to the actual total size of data of col1 and col2, BQ will just charges you the size of data you fetched.
If the data of col1 and col2 in this table is 150MB only, this query will charge you 150MB.
But the size of evaluated data is not only the size of fetched data.
Any column appears in WHERE, GROUP BY, ORDER BY, PARTITION BY or any other functions,
will be considered as “evaluated”.
The charged size of data is not just the total size of data in col1 and col2,
but the total size of col1, col2, col3, because col3 is in the WHERE clause.
If the data in col3 is 30MB, then the query will charge you 180MB in total.
Note all the data in col3 will be charged, not just the part that IS NOT NULL.
Actually, BigQuery did optimize the pricing for us.
If there is some columns is not necessary in the SELECT list of a sub-query,
BigQuery will not charge you for that.
As you can see, even if we selected all the columns in the table,
BQ will not charge all of them.
In this case, only the data of col1, col2, col3 will be charged.
Tips on optimizing charge
After knowing how BigQuery would charge us,
it will be clear on how to save our money.
Here is some tips I concluded from the experience of using BigQuery.
Tip 0: Make sure you need BigQuery
This is not actually a tip, but a very important point before you start to use BQ.
In general, BigQuery is designed to deal with massive data,
and of course designed to charge from processing the massive data.
If you don’t have the data that can’t be processed by your current tech,
you could just keep using MySQL, SPSS, R, or even Excel.
Because, well, they’re relatively cheap.
Tip 1: Do use wildcard function
This will be your very first step to save money on BigQuery.
Let’s say, we have a log_table to store all the access log lines on the server.
If I want to do some query on only one or two day’s log, for instance,
get the users who accessed to the server in certain days,
there is nothing I can do rather that:
Well, never do this.
If you collected years of logs in this table,
this one single query will make you cry.
The better solution, the correct one, is to import the log of each day
into different tables with a date suffix in the first place, like following:
But in each table you only have less than 1MB data.
If you have to use this access_log table to generate another kind of tables separated by shop,
you will be charged by 10MB * <num-of-shops> * <number-of-days>,
while if you don’t separate access_log table by shop you will be charged 1MB * <num-of-shops> * <number-of-days> in the worst case.
So, is it bad idea to separate tables other than dates?
NO, it depends.
As you can see, if the size of access_log for each shop on each day is around 10MB or even bigger,
this design is actually perfect.
So, whether separate further depends on the size of your data.
The point is to know the granularities of your data before you make the decision.
Actually, dividing table by not only date but also some other id will make BigQuery not functioning well.
According to my own experience, I have 20000+ tables in one dataset separated by both dates and a id.
The tables look like [table_name]_[extra_id]_[date_YYmmdd].
And following issues will truly happen:
BigQuery’s API call will fail with InternalException occasionally. This will happen to the API call used to create empty table with given schema.
There are following three kinds of time record in each query you have fired. Usually the interval between first one and second one is less than 1 second. When you have a lot of tables like my case, the gap between creationTime and startTime could last 20 seconds.
creationTime: The time when your request arrives the BQ server.
startTime: The time BQ server to start execute the Query.
endTime: The time that BQ server finishes you query or finds it invalid.
BigQuery start to be not able to find your table. There will be query failing because of errors like FROM clause with table wildcards matches no table or **Not found: Table **, while you know exactly the tables do exist. And when you run the failed query again without fixing anything, the query will dramatically succeed.
So, I STRONGLY recommend not to use too many tables in the same dataset.
Tip 4: Think, before use mid-table
The mid-table is the table you use to store the temporary result.
Do use it when:
There will be more than one queries going to use this mid-table.
Let’s say, by doing query A, you get the result you want.
But there is another query B using the sub-query C within query A.
If don’t use mid-table, you will be charged by:
remaining query in A after sub-query C
remaining query in B after sub-query C
But if you use mid-table to store the result of sub-query C,
you will be charged by:
remaining query in A after sub-query C
remaining query in B after sub-query C
By doing this, you can save at least the money cost by the data of the size of the mid-table.
Do NOT use it when:
If you only fear your query is too long, while there is no other query using the mid-table you generate.
Not all people will agree on this.
Some people may start to say “This is too nasty”, “God, this query is too fat.” or even “WTF is this?”.
Well, generate mid-table in this situation surely charging you more.
I will just choose save more of my money.
And if you format your query well, no matter how long your query grows,
it will still looks nice.
So, these tips are some thing I concluded these days, hope they help you,
and save your money.
The combination of start_of_session and is_next_access_sos and the meaning behind at this point must by one of the following:
| start_of_session | is_next_access_sos | this_access_must_be |
| true | false | the first access in the session with number of access >= 2 |
| true | true | the only access in the session |
| true | null | the only access in the session, and the last access in the partition |
| false | true | the last access in the session with number of access >= 2 |
| false | false | this access is not the first access nor the last in the session with number of access >= 3 |
| false | null | the last access in the session with number of access >=2, and this access is the last one in the partition |
Knowing this, we can easily get whether an access is the last access in the partition.
After this step, we get all the accesses that are either the start or the end of a session in the result.