Apache Storm CheatSheet
Introduction
Step1: Comparison to hadoop
- In
hadoopjobs finish (usually and hopefully :). Instormthey continously ingest events. - In hadoop you have
jobtrackerInstormyou havenimbusassigning tasks to machines monitoring failures etc. - Each nodes runs
supervisor, it listens to work and starts and stopsworkersto do some job based on whatnimbusassigned to it.workersruns subset of topology could run multipleboltsfor example. zookeepercoordinatesnimbusandsupervisors.nimbusandsupervistorsare stateless you can kill nimbus or supervistor and they will come back like nothing has happened.
Step 2: Topologies
A graph of computation. To run a topology:
- package all your code in a jar
- run the jar
storm jar mycode.jar MyMainTopology arg1 arg2 - You can use any programming language! as all communication and input and output are in
thrift.
Step 3: Spouts and bolts
- spouts read stream, for example they register to some stream.
- bolts do omputations, and can emit streams or multiple streams talk to
databaseetc. - When
boltemit stream it would be received by allboltsthat registered to thatstream - You can store data stateful in your bolt that is less recommanded but you can do that. if restarted memory is restated.
- when you store something in memory better use
fieldgroupingso the samekeyswould reach the samebolts. in shutflegrouping the spread is random.
Step 4: parallelism
- You can define how much parallelism you want for each node and storm will spawn that number of threads across the cluste rlke parallelism for a specific
bolt.