Druid..!!

Druid Components in Design

ㅁ When should I use Druid?

Insert rates are very high, but updates are less common.
( 사실 Insert 만 있어야함. update 는 안됌. 모든 Data 는 immutable 한 상태라 보는 것이 맞음 )
Most of your queries are aggregation and reporting queries (“group by” queries). You may also have searching and scanning queries.
group by, filtering 을 구현하기에 좋다
You are targeting query latencies of 100ms to a few seconds.
( 100ms ~ 수초 이내에 query response 가 요구될 때 )
Your data has a time component
( 이것도 필수임 )
You have high cardinality data columns (e.g. URLs, user IDs) and need fast counting and ranking over them.
( Counting, Ranking 같은 연산은 참 빠름 )
You want to load data from Kafka, HDFS, flat files, or object storage like Amazon S3. ( 나름 유명한 데이터 오픈소스와의 integration 을 잘 지원해줌 )

ㅁ 그렇다면 쓰지 말아야 할때는??

참 아이러니 하게도(?) Druid 는 쓰지 말아야 할때를 공식문서에 잘 정리해 두었다. ( 뭔가 믿음직 스러운… +_+ ㅎㅎ )
primary key 를 통한 update
query latency 가 중요한 reporting 시스템
“big” join 이 필요한 query ( 20만건 이내는 lookup 으로 해결 가능하긴 함, Infra 성능에 따라 다릅니다. )

ㅁ Druid’s main features

Columnar storage format
Scalable distributed system
Massively parallel processing
Realtime or batch ingestion
Self-healing, self-balancing, easy to operate
Cloud-native, fault-tolerant architecture that won’t lose data
Indexes for quick filtering
Approximate algorithms
Automatic summarization at ingest time

ㅁ Architecture

Overview

_config.yml

Storage

Segments

segment file’s data structure : Timestamp, Demensions, Metrics
Multi-value Columns 도 가능함 ( 여러개 들어가도 Bitmap indexing 이 가능하기 때문) ``` javascript 1: Dictionary that encodes column values { “Justin Bieber”: 0, “Ke$ha”: 1 }

2: Column data [0, [0,1], <–Row value of multi-value column can have array of values 1, 1]

3: Bitmaps - one for each unique value value=”Justin Bieber”: [1,1,0,0] value=”Ke$ha”: [0,1,1,1] ^ | | Multi-value column has multiple non-zero entries

<br/>
##### &nbsp;&nbsp; Data Structures
![_config.yml](/images/druid/data_structure.png) <br/>
 
 - 기본적으로 Timestamp, Dimensions, Metrics 가 있음
 - Dimensioons 컬럼에 의해 group-by, filter operation 이 일어남

<br/>
##### Segment Components
 - `version.bin` : 4bytes 로 버전을 나타냄
 - `meta.smoosh` : meta data ( 다른 smoosh 파일에 대한 )
 - `XXXXX.smoosh` : 데이터의 minimized 된 파일. 데이터의 각 열에 대한 개별 파일과 Segment 에 대한 extra metadata 가 있는 index.drd 파일이 있음

<br/>
##### Format of a column
```text
    1. A Jackson-serialized ColumnDescriptor
    2. The rest of the binary for the column

Sharding

Segment 의 form 은 interval 별 block 이다. 이 block 은 shardSpec에 의존하고 druid의 queries 는 이 block 이 완성되어야만 완료된다. ( 즉, block 이 만들어 지기 전까지는 해당 interval 에 대해선 query 가 안날라 간다는 의미인듯 함 )

//example block files
sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_0
sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_1
sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_2

ㅁ Nodes Types

Historical

Running
```
$ io.druid.cli.Main server historical
```
역할
- Segments 를 로딩하고 서빙하는 역할, 캐시에 저장해놓는 역할
- Segments 에 대해서 querying 하는 역할도 담당 ( groupBy, topN 등 쿼리의 경우 historical nodes 에서 부분적으로 먼저 처리하고 처리한 결과를 전달 )

Broker

역할
- query 라우팅 역할 ( 주키퍼를 통해 쿼리할 대상의 노드를 알아냄.)
- Historical Nodes 로부터 받은 결과를 merge 한다.
- Realtime Nodes 로부터도 결과를 같이 받아 merge 한다.
- LRU caching per-segment results

Coordinator

 $ io.druid.cli.Main server coordinator

HTTP endpoints supported : 리더 정보, 세그먼트 로딩, 데이터소스 정보 GET, 데이터 소스 POST, 리텐션 룰 GET 등
primarily responsible for segment management and distribution
주기적으로 돌면서 적절한 action 을 취함.
broker / historical nodes 와 비슷하게 zookeeper 와 연결되어 현재 클러스터의 정보를 얻는다.
또한 available segments 와 rules 정보를 위해 metastore 와도 connection을 맺고 있다.
역할 : Cleaning Up Segments, Historical Nodes 로 로딩되는 Segments 의 밸런싱을 담당.

Indexing Service

Overlord

역할 : Task 를 받고, Task 를 분배하는 역할을 한다. 그리고 caller 에게 Task 에 대한 결과를 return 해준다.

MiddleManager

io.druid.cli.Main server middleManager

역할 : submitted 된 task 를 처리하는 worker node다. peon 앞에서 task 를 전달해주는 역할을 한다.

ㅁ Dependencis

Deep Storage

Types : Local Mount, S3-compatible, HDFS, etc..

Metadata Storage

Types : derby ( for not production), MySQL, PostgreSQL

Metadata Storage Tables

Segments Table

used column : 사용 여부

payload column :

 {
  "dataSource":"wikipedia",
  "interval":"2012-05-23T00:00:00.000Z/2012-05-24T00:00:00.000Z",
  "version":"2012-05-24T00:10:00.046Z",
  "loadSpec":{
     "type":"s3_zip",
     "bucket":"bucket_for_segment",
     "key":"path/to/segment/on/s3"
  },
  "dimensions":"comma-delimited-list-of-dimension-names",
  "metrics":"comma-delimited-list-of-metric-names",
  "shardSpec":{"type":"none"},
  "binaryVersion":9,
  "size":size_of_segment,
  "identifier":"wikipedia_2012-05-23T00:00:00.000Z_2012-05-24T00:00:00.000Z_2012-05-23T00:10:00.046Z"
 }

Druid cluster 의 segement 관리는 MySQL 을 통해 저장된다. 그렇기에 Segments 를 Delete 후 Reload 할때에도 API 를 통해서도 되지만 MySQL Segment 테이블의 Used 를 1 로 하는 방법도 있다. ( 단!! 주의!! Overwirte 된 Segment 의 경우 version 이 높은 것만 골라서 1로 바꿔줘야 한다.)
Zookeeper

역할
- Coordinator leader election
- Segment “publishing” protocol from Historical and Realtime
- Segment load/drop protocol between Coordinator and Historical
- Overlord leader election
- Indexing Service task management

Written on September 20, 2018