Apache Storm CheatSheet
Introduction
Step1: Comparison to hadoop
- In
hadoop
jobs finish (usually and hopefully :). Instorm
they continously ingest events. - In hadoop you have
jobtracker
Instorm
you havenimbus
assigning tasks to machines monitoring failures etc. - Each nodes runs
supervisor
, it listens to work and starts and stopsworkers
to do some job based on whatnimbus
assigned to it.workers
runs subset of topology could run multiplebolts
for example. zookeeper
coordinatesnimbus
andsupervisors
.nimbus
andsupervistors
are stateless you can kill nimbus or supervistor and they will come back like nothing has happened.
Step 2: Topologies
A graph
of computation. To run a topology:
- package all your code in a jar
- run the jar
storm jar mycode.jar MyMainTopology arg1 arg2
- You can use any programming language! as all communication and input and output are in
thrift
.
Step 3: Spouts and bolts
- spouts read stream, for example they register to some stream.
- bolts do omputations, and can emit streams or multiple streams talk to
database
etc. - When
bolt
emit stream it would be received by allbolts
that registered to thatstream
- You can store data stateful in your bolt that is less recommanded but you can do that. if restarted memory is restated.
- when you store something in memory better use
fieldgrouping
so the samekeys
would reach the samebolts
. in shutflegrouping the spread is random.
Step 4: parallelism
- You can define how much parallelism you want for each node and storm will spawn that number of threads across the cluste rlke parallelism for a specific
bolt
.