Tuesday, January 27, 2015

Debugging Real Time Apps



Debug   an application is usually a complicated task. It’s easy to correct the most simple errors, but there are some errors that are context sensitive and tend to require a lot more dedication. Fortunately, we have very useful tools (debuggers) to fight such errors. By setting breakpoints in the appropriate lines and evaluate certain conditions, we will be able to find and correct any mistakes we may have made in our application.

This scenario becomes more complex when we move from a mono application thread to a multi-thread one. We must bear in mind that multiple thread of execution can run in parallel, so we have to know when to stop each thread. If we don’t do this properly, we can observe that the context may change because some thread can continue running while watching. At this point, debuggers remain powerful tools, but we have to know the flow of application execution to stop everything that we observe.

But it can still be more dangerous ... and our game is a perfect example. As just explained, multi-threaded applications can be tested if we know how the application works. We can do this because we can, at any given time, freeze the application and assess certain information to be monitored ... but what if we can not freeze? Or better yet ... what if by stopping our application we  change the context that we are evaluating ?? (Heisenberg uncertainty principle applied to software). Our observation affectschanges the context. By having an independent clock controlling our application, any action that may affect this clock also affects the application itself.


Theoretically, but almost impossible to do without affecting the clock, we could stop all client and server threads at a time. The problem is we cannot resume these threads at the same time without affecting the entire life cycle. If a client thread is executed before the server, the movements of customers arrive sooner than expected. At the other side,  the movement arrives late, the server discards and force the customer to reverse the movement (if necessary). If our application to operate with high turn time (several minutes) we still  could handle this scenario using the technique just explained (stop all threads and resume at the same time). But is not the case: our tick time are always below 1 second and normally around 250 ms ... so you can not debug your applications with our beloved debuggers.

Then? How can we correct our application errors? Based on our experience, there are two approaches that can be combined to try to achieve our goal of detecting and monitoring what happens within the code.

    1. Traces within the code: Yes! The origin of species. The beginning of everything. Who has not ever printed a message? It is a basic strategy but it still works. We just need to add some metadata to the message and we will be able to track your application: We have to write the identifier of the thread that paints the message (we know the author) and when the event occurred every time we write something . In this way we will know who (thread), when (date and time) and what (detailed message) has happened.

    2. Specific monitoring tool: We have already  seen that we cannot analyze multithread application context  in real time. Once we evaluate something, this has changed. But what if the picture is taken within our application? Let’s Develop a tool to take pictures of the entire context for us and then analyze the data collected. We need a probe! By far it is the best approach. Of course, we must develop specific code to achieve this, but once developed we will know the internal behavior and what happens within our application. In the end, it will save time and it let us to properly understand any problem that we could detect. In our case, we have developed a tool to control what happens on the server.
We Collect movements coming from each remote client and analyzes if they are late, their increased latency, state, etc ... we will see in future posts that this tool will help us find various ways to improve the performance of our applications.

Here is an example of the output of our internal tool.

GAME 0
Frequency: 300   MaxLag:0  Offset:90
TURN 0
--------
Tick    StartedDate    Status Player 0     Mov Received Player 0 Advance Player 0              Inc advance Player 0  Next Sleep Player 0   Lag PLayer 0          Status Player 1     Mov Received Player 1 Advance Player 1              Inc advance Player 1  Next Sleep Player 1   Lag PLayer 1         
0       09:44:50.991   ON TIME             09:44:50.880          111                                                                                             ON TIME             09:44:50.896          95                                                                                              

1       09:44:51.308   ON TIME             09:44:51.210          98                            -13                   304.0                 0                     ON TIME             09:44:51.210          98                            3                     304.0                 0                    

2       09:44:51.626   ON TIME             09:44:51.527          99                            1                     304.5                 0                     ON TIME             09:44:51.528          98                            0                     304.0                 0                    

3       09:44:51.904   ON TIME             09:44:51.832          72                            -27                   291.0                 0                     ON TIME             09:44:51.832          72                            -26                   291.0                 0                    

4       09:44:52.223   ON TIME             09:44:52.124          99                            27                    304.5                 0                     ON TIME             09:44:52.102          121                           49                    315.5                 0                    

No comments:

Post a Comment