Recently, I started a somewhat complicated task to do log analysis on multiple logs. My first option was to use Spark but then I remembered an article that I had read a while ago about how Cloudflare had migrated from postgresdb (a.k.a RollupDB) to ClickHouse, and as a result I wrote this guide on how to install and try it out for yourself. For these instructions, I have used Debian 9 on Civo, and you can follow along by getting yourself an account. Civo offers $250 free credit to new users for a month: Sign up here.

What is ClickHouse?

According to the official site:

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. ClickHouse scales well both vertically and horizontally. ClickHouse is easily adaptable to perform either on cluster with hundreds of nodes, or on a single server or even on a tiny virtual machine. Currently there are installations with more than two trillion rows per single node, as well as installations with 100Tb of storage per single node.

Install ClickHouse

For this part we will create a Large Instance (4 CPU, 8 GB, 100 GB SSD) in Civo, running Debian 9. You can either do this on the web dashboard, or using the Civo command-line tool.

To match the demonstration here, you can call your instance test-clickhouse. The RAM and CPU requirements will be based on the amount of data you want to handle, or how many logs you want to process. If you have a huge data set, you may need more proecssing power.

In this example case we will use USA civilian flight data from 1987 to 2018. It's not exactly "Big Data" at 183 million rows and 15 GB of compressed data, but this demo set allows us to quickly get to work and demonstrate the capabilities of ClickHouse.

Once you have access to your newly-created server, our first step will be to install ClickHouse:

Install packages for Ubuntu/Debian

sudo apt-get install dirmngr curl
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4

echo "deb http://repo.yandex.ru/clickhouse/deb/stable/main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update

sudo apt-get install -y clickhouse-server clickhouse-client

sudo service clickhouse-server start
clickhouse-client

The above should result in you being logged in to your ClickHouse client. You can exit this and continue with the guide by typing in the command exit.

Now we need to download the data for our test. Compressed, this is about 15 GB. You can get it from the official repository:

curl -O https://clickhouse-datasets.s3.yandex.net/ontime/partitions/ontime.tar
sudo tar xvf ontime.tar -C /var/lib/clickhouse
sudo service clickhouse-server restart

After this we can open and begin to use ClickHouse:

$ clickhouse-client

ClickHouse client version 19.13.3.26 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.13.3 revision 54425.

test-clickhouse :)

Running a Query

test-clickhouse :) select count() from datasets.ontime;

SELECT count()
FROM datasets.ontime

┌───count()─┐
│ 183953732 │
└───────────┘

1 rows in set. Elapsed: 0.376 sec. Processed 183.95 million rows, 1.66 GB (488.84 million rows/s., 4.40 GB/s.)

Not bad! 183.95 million rows in 0.376 sec. Now for a more complex query:

The most popular destination in 2015

test-clickhouse :) SELECT
:-]     OriginCityName,
:-]     DestCityName,
:-]     count(*) AS flights,
:-]     bar(flights, 0, 20000, 40)
:-] FROM datasets.ontime WHERE Year = 2015 GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20

SELECT
    OriginCityName,
    DestCityName,
    count(*) AS flights,
    bar(flights, 0, 20000, 40)
FROM datasets.ontime
WHERE Year = 2015
GROUP BY
    OriginCityName,
    DestCityName
ORDER BY flights DESC
LIMIT 20

┌─OriginCityName────┬─DestCityName──────┬─flights─┬─bar(count(), 0, 20000, 40)──────┐
│ San Francisco, CA │ Los Angeles, CA   │   15116 │ ██████████████████████████████  │
│ Los Angeles, CA   │ San Francisco, CA │   14799 │ █████████████████████████████▌  │
│ New York, NY      │ Chicago, IL       │   14734 │ █████████████████████████████▍ │
│ Chicago, IL       │ New York, NY      │   14632 │ █████████████████████████████▎ │
│ New York, NY      │ Boston, MA        │   13201 │ ██████████████████████████▍    │
│ Boston, MA        │ New York, NY      │   13201 │ ██████████████████████████▍    │
│ New York, NY      │ Los Angeles, CA   │   13113 │ ██████████████████████████      │
│ Los Angeles, CA   │ New York, NY      │   13106 │ ██████████████████████████▏    │ 
│ Chicago, IL       │ Washington, DC    │   12509 │ █████████████████████████       │
│ Washington, DC    │ Chicago, IL       │   12310 │ ████████████████████████▌       │
│ Atlanta, GA       │ Chicago, IL       │   12213 │ ████████████████████████▍      │
│ Chicago, IL       │ Atlanta, GA       │   12103 │ ████████████████████████▏      │
│ Los Angeles, CA   │ Chicago, IL       │   11111 │ ██████████████████████▏        │
│ Atlanta, GA       │ New York, NY      │   11004 │ ██████████████████████          │
│ New York, NY      │ Atlanta, GA       │   10986 │ █████████████████████▊         │
│ Miami, FL         │ New York, NY      │   10790 │ █████████████████████▌          │
│ New York, NY      │ Miami, FL         │   10779 │ █████████████████████▌          │
│ Chicago, IL       │ Los Angeles, CA   │   10755 │ █████████████████████▌          │
│ Las Vegas, NV     │ Los Angeles, CA   │   10657 │ █████████████████████▎         │
│ Boston, MA        │ Washington, DC    │   10655 │ █████████████████████▎         │
└───────────────────┴───────────────────┴─────────┴─────────────────────────────────┘

20 rows in set. Elapsed: 0.158 sec. Processed 7.19 million rows, 331.68 MB (45.42 million rows/s., 2.10 GB/s.)

Flight delay dependence on the day of week

test-clickhouse :) SELECT DayOfWeek, count() AS c, avg(DepDelay >  60) AS delays FROM datasets.ontime GROUP BY DayOfWeek ORDER BY DayOfWeek

SELECT
    DayOfWeek,
    count() AS c,
    avg(DepDelay > 60) AS delays
FROM datasets.ontime
GROUP BY DayOfWeek
ORDER BY DayOfWeek ASC

┌─DayOfWeek─┬────────c─┬───────────────delays─┐
│         1 │ 27128506 │  0.04550659737768088 │
│         2 │ 26814992 │  0.03956365155730794 │
│         3 │ 26953717 │   0.0421657243043696 │
│         4 │ 27064257 │  0.04909072508437974 │
│         5 │ 27109167 │  0.05207943128610333 │
│         6 │ 23255555 │  0.03631652738453243 │
│         7 │ 25627538 │ 0.044403250909236774 │
└───────────┴──────────┴──────────────────────┘

7 rows in set. Elapsed: 0.509 sec. Processed 183.95 million rows, 919.77 MB (361.17 million rows/s., 1.81 GB/s.)

Now the last example query, Cities of departure with most frequent delays of 1 hour or longer

test-clickhouse :) SELECT OriginCityName, count() AS c, avg(DepDelay >  60) AS delays FROM datasets.ontime GROUP BY OriginCityName HAVING c >  100000 ORDER BY delays DESC LIMIT 20

SELECT
    OriginCityName,
    count() AS c,
    avg(DepDelay > 60) AS delays
FROM datasets.ontime
GROUP BY OriginCityName
HAVING c > 100000
ORDER BY delays DESC
LIMIT 20

┌─OriginCityName──────┬────────c─┬───────────────delays─┐
│ Fayetteville, AR    │   195768 │  0.06756466838298394 │
│ Newark, NJ          │  3867059 │  0.06679804988752434 │
│ Chicago, IL         │ 11572746 │   0.0620933873429867 │
│ San Francisco, CA   │  4319419 │  0.06055027308070831 │
│ Eugene, OR          │   118873 │  0.05775912107879838 │
│ White Plains, NY    │   210879 │ 0.056572726539864095 │
│ Santa Barbara, CA   │   206182 │  0.05648892725844157 │
│ New York, NY        │  5716750 │  0.05643696156032711 │
│ Springfield, MO     │   147904 │  0.05575914106447425 │
│ Burlington, VT      │   160169 │  0.05564122895191954 │
│ Miami, FL           │  2187791 │  0.05382918203795518 │
│ Asheville, NC       │   100875 │  0.05364064436183395 │
│ Philadelphia, PA    │  2981859 │ 0.052901562414587676 │
│ Monterey, CA        │   112771 │ 0.052646513731367105 │
│ Fort Lauderdale, FL │  1734414 │ 0.052616042075306124 │
│ Boston, MA          │  3439462 │    0.051617084299812 │
│ Akron, OH           │   155093 │  0.05146589465675434 │
│ Columbia, SC        │   234073 │  0.05120197545210255 │
│ Valparaiso, FL      │   116229 │  0.05107159142726858 │
│ Portland, ME        │   227220 │ 0.050528122524425664 │
└─────────────────────┴──────────┴──────────────────────┘

20 rows in set. Elapsed: 1.678 sec. Processed 183.95 million rows, 4.81 GB (109.61 million rows/s., 2.87 GB/s.)

As you can see Clickhouse makes the work perfect: on an instance with 8 GB of RAM and 4 CPUs, the response times are very good given 183,953,732 records. If you want to run more queries, there are more examples in the tutorial.

Conclusion

I hope this has given you an interest in ClickHouse. I personally think it is a very good solution for the analysis of logs and data processing. In a future guide I hope to show you the performance improvements that can be achieved by assembling a cluster to run these queries.

Also, ClickHouse has an API that allows you to use a whole variety of languages (Python, PPH, Go, C ++, Ruby, etc) and include its powerful capabilities in your workflow very easily.

Next steps:

  • Read Roman Leventov's comparison post on the performance of various open source OLAP Systems, including ClickHouse
  • Follow Civo on Twitter!