Monitoring Reactive Apps
with Kamon



Presented by Ivan Topolnjak / @ivantopo

The Kamon Project


  • Open source project available on Github.
  • Provides production quality monitoring tools for
    applications built on top of Akka, Spray and Play!
  • Been under development for over a year.
  • Typesafe technology partners since May 2014.
  • Already being used in production.

Motivations


  • Traditional monitoring tools don't play well
    with the reactive model.
  • No open source options for monitoring
    reactive applications.
  • Only one commercial offering for monitoring
    reactive applications.


We wanted to help people succeed when building reactive applications
and production monitoring is a top priority for real success.

Why production monitoring?



There is a gap between your mental model, your development environment and your production environment.

Suggested material: "Metrics, Metrics Everywhere" by Coda Hale.

How to do it right?


  • Start with high level metrics, like user experienced response time.
    • How long does a login take?
    • How long is the user waiting for search results?

  • Go a bit deeper and analyze sections of functionality within your app.
    • How long is the external Http authentication service taking?
    • How long did the "select all products" JDBC call take?

  • Go even deeper and analyze the core components of your app.
    • How many messages is handling this actor?
    • How big is its mailbox?
    • How many active threads do we have in this dispatcher?

The Traditional Model





There is a dedicated thread per request. When you ask for a external resource your thread is there, doing nothing, awaiting for the resource it needs to continue processing.

The Enhanced Traditional Model





Some pieces of bussiness logic might be submitted to worker threads, but still there is a dedicated thread waiting for the results.

The Reactive Model




All processing stages happen asynchronously, as a reaction to the previous stage completion. When a thread is not being used then the thread is free to collaborate in other parts of the application.






Tool Mapping






Traditional Model: ThreadLocal



Enhanced Traditional Model:
ThreadLocal + Utils



Reactive Model: ???



Enter the TraceContext



  • It's attached to events while they flow in the system.
  • Has a name.
  • Has a token.
  • Can contain segments.

TraceContext propagation: Actors



  • The TraceContext available when sending a message is also available when (and only when) processing that message in the target actor.
    
      actorRef ! "some message"
      actorRef ? "some question"
      pipe(someFuture) to actorRef
                  

  • When a actor fails and the supervision mechanism kicks in, the related system messages carry the TraceContext too.
    
      def receive = {
        case "fail" => 1/0
      }
                  

TraceContext propagation: Futures



  • The TraceContext available when creating a future is also available when the future's body is executing.
    
      val lottoNumber = Future {
        expensivePredictiveAnalysis()
      }
                  

  • The TraceContext available when creating a future is also available when executing any callbacks on the future.
    
      lottoNumber.map(verifyAuthenticity).map(sendToMe)
                  

Simple TraceContext API



  • Wrap the code where your business transaction starts like this:
    
      TraceRecorder.withNewTraceContext("load-user-data") {
        userService ! UserService.Find(username)
      }
                  

  • Identify where your business transaction ends and finish the trace.
    
      TraceRecorder.finish()
                  

Reactive Model: TraceContext



Let's measure!

What is important for you?


  • Mean.
  • Median.
  • Maximun.
  • Standard Deviation.
  • Percentiles.
    • 75 %
    • 90 %
    • 95 %
    • 99 %
    • 99.99 %



The answer is simple:


We don't know!






We will keep all the data and let you decide.






The challenge


  • Store millions of measurements per second.
  • Have a limited and predictable memory footprint.
  • Very low overhead for recording measurements.





Is it even possible?











Yes! Thanks to the
HDR Histogram






HDR Histogram Layout





  • The underlying data structure is allocated only once.
  • Each bucket count ocurrencies of the correspondent value.
  • There are no allocations when recording a measurement.
  • Very low recording overhead: 3-6 nanoseconds on
    modern (circa 2012) Intel CPUs.
  • Created by Gil Tene, source code at Github.





Let's go deeper with segments






Trace Segments




  • Piece of functionality that happens inside a trace.
  • Might be used from different traces.

Simple Segments API



  • Start a segment from any point in your application using the TraceRecorder:
    
      val segmentCompletionHandle = TraceRecorder.startSegment(HttpClientRequest("google"))
                  

  • Finish the segment and metrics will be generated for the segment too!
    
      segmentCompletionHandle.finish()
                  





Let's go even deeper!






Actor System Metrics



  • Actor Metrics:
    • processing-time: Time taken by the actor to process each message
    • mailbox-size: Size of the mailbox per Actor
    • time-in-mailbox: Time spent since a message was put in a mailbox until it's processing starts

  • Dispatcher Metrics:
    • queue-size: Dispatcher queue size
    • active-threads: Number of active threads

User Metrics



Simple and useful instruments based on the same
cool stuff we already saw:

  • Counters
  • Histograms
  • Gauges



All of them at your service, for whatever use you want.






Tools for fun and profit






Trace Token Logging



  22:24:07.197 INFO  [undefined] Upper casing [Hello without context]
  22:24:07.198 INFO  [undefined] Upper casing [Hello without context]
  22:24:07.198 INFO  [undefined] Calculating the length of: [HELLO WITHOUT CONTEXT]
  22:24:07.199 INFO  [undefined] Upper casing [Hello without context]
  22:24:07.200 INFO  [undefined] Upper casing [Hello without context]
  22:24:07.200 INFO  [undefined] Upper casing [Hello without context]
  22:24:07.204 INFO  [undefined] Calculating the length of: [HELLO WITHOUT CONTEXT]
  22:24:07.205 INFO  [undefined] Calculating the length of: [HELLO WITHOUT CONTEXT]
  22:24:07.205 INFO  [undefined] Calculating the length of: [HELLO WITHOUT CONTEXT]
  22:24:07.206 INFO  [undefined] Calculating the length of: [HELLO WITHOUT CONTEXT]
          

No monitoring info: logs nightmare!

Trace Token Logging



  22:24:07.197 INFO  [localhost-1] Upper casing [Hello with context]
  22:24:07.198 INFO  [localhost-2] Upper casing [Hello with context]
  22:24:07.198 INFO  [localhost-1] Calculating the length of: [HELLO WITH CONTEXT]
  22:24:07.199 INFO  [localhost-3] Upper casing [Hello with context]
  22:24:07.200 INFO  [localhost-4] Upper casing [Hello with context]
  22:24:07.200 INFO  [localhost-5] Upper casing [Hello with context]
  22:24:07.204 INFO  [localhost-2] Calculating the length of: [HELLO WITH CONTEXT]
  22:24:07.205 INFO  [localhost-3] Calculating the length of: [HELLO WITH CONTEXT]
  22:24:07.205 INFO  [localhost-4] Calculating the length of: [HELLO WITH CONTEXT]
  22:24:07.206 INFO  [localhost-5] Calculating the length of: [HELLO WITH CONTEXT]
          

TraceToken in logs: everything makes sense! (and is grep friendly)

Automatic Trace Token Propagation



Automatic Trace Token Propagation


Trace Token Logging

  22:24:07.197 INFO  [search-host-01-54820] Searching cheapest hotels for: 'Zagreb'
            

+

Automatic Trace Token Propagation

  X-Trace-Token: "search-host-01-54820"
            

+

Log aggregator (splunk, elasticsearch) =


Distributed Trace

Spray Integration


  • Automatic trace start and finish.
  • Automatic segment recognition for
    spray-client HTTP requests.
  • Automatic trace token propagation.

Play! Integration


  • Automatic trace start and finish.
  • Automatic segment recognition for
    WS HTTP requests.
  • Automatic trace token propagation.





Metric Backends






StatsD


  • Actor System Metrics.
  • Trace Metrics.
  • Segment Metrics.
  • Endless integration posibilities.
  • Docker image with predefined dashboard.

StatsD

StatsD


New Relic


  • Trace Metrics.
  • More metrics will come shortly.

New Relic







Questions?