Step1: Comparison to hadoop
hadoopjobs finish (usually and hopefully :). In
stormthey continously ingest events.
- In hadoop you have
nimbusassigning tasks to machines monitoring failures etc.
- Each nodes runs
supervisor, it listens to work and starts and stops
workersto do some job based on what
nimbusassigned to it.
workersruns subset of topology could run multiple
supervistorsare stateless you can kill nimbus or supervistor and they will come back like nothing has happened.
Step 2: Topologies
graph of computation. To run a topology:
- package all your code in a jar
- run the jar
storm jar mycode.jar MyMainTopology arg1 arg2
- You can use any programming language! as all communication and input and output are in
Step 3: Spouts and bolts
- spouts read stream, for example they register to some stream.
- bolts do omputations, and can emit streams or multiple streams talk to
boltemit stream it would be received by all
boltsthat registered to that
- You can store data stateful in your bolt that is less recommanded but you can do that. if restarted memory is restated.
- when you store something in memory better use
fieldgroupingso the same
keyswould reach the same
bolts. in shutflegrouping the spread is random.
Step 4: parallelism
- You can define how much parallelism you want for each node and storm will spawn that number of threads across the cluste rlke parallelism for a specific