Tuesday, January 27, 2015

Debugging Real Time Apps



Debug   an application is usually a complicated task. It’s easy to correct the most simple errors, but there are some errors that are context sensitive and tend to require a lot more dedication. Fortunately, we have very useful tools (debuggers) to fight such errors. By setting breakpoints in the appropriate lines and evaluate certain conditions, we will be able to find and correct any mistakes we may have made in our application.

This scenario becomes more complex when we move from a mono application thread to a multi-thread one. We must bear in mind that multiple thread of execution can run in parallel, so we have to know when to stop each thread. If we don’t do this properly, we can observe that the context may change because some thread can continue running while watching. At this point, debuggers remain powerful tools, but we have to know the flow of application execution to stop everything that we observe.

But it can still be more dangerous ... and our game is a perfect example. As just explained, multi-threaded applications can be tested if we know how the application works. We can do this because we can, at any given time, freeze the application and assess certain information to be monitored ... but what if we can not freeze? Or better yet ... what if by stopping our application we  change the context that we are evaluating ?? (Heisenberg uncertainty principle applied to software). Our observation affectschanges the context. By having an independent clock controlling our application, any action that may affect this clock also affects the application itself.


Theoretically, but almost impossible to do without affecting the clock, we could stop all client and server threads at a time. The problem is we cannot resume these threads at the same time without affecting the entire life cycle. If a client thread is executed before the server, the movements of customers arrive sooner than expected. At the other side,  the movement arrives late, the server discards and force the customer to reverse the movement (if necessary). If our application to operate with high turn time (several minutes) we still  could handle this scenario using the technique just explained (stop all threads and resume at the same time). But is not the case: our tick time are always below 1 second and normally around 250 ms ... so you can not debug your applications with our beloved debuggers.

Then? How can we correct our application errors? Based on our experience, there are two approaches that can be combined to try to achieve our goal of detecting and monitoring what happens within the code.

    1. Traces within the code: Yes! The origin of species. The beginning of everything. Who has not ever printed a message? It is a basic strategy but it still works. We just need to add some metadata to the message and we will be able to track your application: We have to write the identifier of the thread that paints the message (we know the author) and when the event occurred every time we write something . In this way we will know who (thread), when (date and time) and what (detailed message) has happened.

    2. Specific monitoring tool: We have already  seen that we cannot analyze multithread application context  in real time. Once we evaluate something, this has changed. But what if the picture is taken within our application? Let’s Develop a tool to take pictures of the entire context for us and then analyze the data collected. We need a probe! By far it is the best approach. Of course, we must develop specific code to achieve this, but once developed we will know the internal behavior and what happens within our application. In the end, it will save time and it let us to properly understand any problem that we could detect. In our case, we have developed a tool to control what happens on the server.
We Collect movements coming from each remote client and analyzes if they are late, their increased latency, state, etc ... we will see in future posts that this tool will help us find various ways to improve the performance of our applications.

Here is an example of the output of our internal tool.

GAME 0
Frequency: 300   MaxLag:0  Offset:90
TURN 0
--------
Tick    StartedDate    Status Player 0     Mov Received Player 0 Advance Player 0              Inc advance Player 0  Next Sleep Player 0   Lag PLayer 0          Status Player 1     Mov Received Player 1 Advance Player 1              Inc advance Player 1  Next Sleep Player 1   Lag PLayer 1         
0       09:44:50.991   ON TIME             09:44:50.880          111                                                                                             ON TIME             09:44:50.896          95                                                                                              

1       09:44:51.308   ON TIME             09:44:51.210          98                            -13                   304.0                 0                     ON TIME             09:44:51.210          98                            3                     304.0                 0                    

2       09:44:51.626   ON TIME             09:44:51.527          99                            1                     304.5                 0                     ON TIME             09:44:51.528          98                            0                     304.0                 0                    

3       09:44:51.904   ON TIME             09:44:51.832          72                            -27                   291.0                 0                     ON TIME             09:44:51.832          72                            -26                   291.0                 0                    

4       09:44:52.223   ON TIME             09:44:52.124          99                            27                    304.5                 0                     ON TIME             09:44:52.102          121                           49                    315.5                 0                    

Monday, January 5, 2015

Mixing TCP and UDP to reduce latency in real-time connections.


Let’s start to use our best cards to deal with latency issues. Remember that previously (last entry) we chose to implement all dialogs between client and server based on TCP protocol. We also deactivated Nagle’s, algorithms among other strategies, that caused our app to experience high latencies.

      We have to make a step forward.

TCP packet delivery control adds  extra unknown sized traffic between client and server that we will pay with lack of gameplay. Every packet has to be acknowledged by the receiver and, in case of data lost, a new packet will be send with its consequent acknowledge. It's easy to understand  how this data interchange can impact in our mission to be fast.

So, let’s think about what kind of traffic we are sending  between client and server?. Can we tolerate some packet lost?. 

Basically, analyzing almost every game,  we have two kind of commands:
1.     The Commands that exchange client and server in order to configurate how the game will be (and prepare it before it starts) must be synchonized. We have to assure that we respect the order sequence described in our interface contract between client and server (see last entry). For example, We can’t pay a drink if we have not already ordered. At this time, the game has not began and it’s not critical if we experience additional delayed milliseconds. So we can and should keep using tcp protocol to implementing these commands. We can fully enjoy TCP packet deliver control since it doesn’t matter at all some little delay.

2.     The Commands inside a game need to be quick. It would seems quite obviously but once inside a game every millisecond saved worths a penny. We will be dealing with ticks about half a second because noone would play our game if we see our snake moving fast as a truck. We have to squeeze our brain to get this part of the game as fast as possible keeping on mind that we are playing across internet. This means different lag from players, changing player latency  during the game and  Players leaving the game once started.

Gathering these premises, it seems clear that we need to keep under control what we really send throught the net. So, we  have to avoid sending extra packets as our main objective… and which protocol is sending extra packets out of our control? TCP.
We need to avoid tcp acknowledge traffic that is generated by TCP deliver control packet. By doing this, we will reduce some data traffic but we will loose also the guarantee that packets are going to arrive at time and at the order that we sent from the other point.
In other words, we are going to use UDP wherever we can and add some extra functionality to guarantee order and delivery issues that we need during the game.

Focusing on those commands related to the game itself there are two kind of subcommands:

1. List of movements consolidated by the server:

Keep on mind that the server is imperative and is in charge of guarantee that every client has the same view of the game. So, we decided to send, once every tick ends, a message containing all the movements consolidated by the server. With this message, the remote client will move non own players and will check if his anticipated movement can be confirmed. If we are lucky, everything is ok but if the remote client movement does not arrive or arrive to the server once the tick is ended we may be in troubles. In fact, we will face with remote/server inconsistencies if the remote client has changed the movement direction because the server won’t realize and the server will have sent back a movement with no change direction.  This means that the remote client will be forced to move back to the last tick scenario and move again based on server moments.

This means that we have to implement some movement buffer mechanism to allow this movement back described above. It also  means that we have no alternative except keep using TCP protocol. This messages has to arrive to the remote clients  sorted and every tick (we can allow certain delay). Otherwise, we would have to implement packet delivery control because it matters the order of list of movements and specially because it would make by far more complex  the implementation of the movement back algorithm. Let’s think how would we do if we can't assure that every tick will receive a confirmation of our predictions.

Yes, you may think, let's send the whole game board every tick. This means more and more data to send and that is one of rules that we can’t break .In fact, remember that we move non own players once list of movements are read by the client. So, the worst thing we can expect is some addicional lag between own player and other remote players.

2. Movement command that the client sends to the server:

Maybe the most important command that we need to be quick is the movement sent by the client to the server. It have to arrive before the server closes the turn tick. So, every movement received later is discarded and it will be used current direction to make the player server movement. This, as describes before,  will produce an ammendment to the remote client that is late once he gets the summarized message sent by the server. This discard issue it’s pretty the same as loosing  a client movement message. So we can use the same strategy.

The main idea is that a
 client message late can be considered as a client message lost. So we don’t care at all if it’s acknowledged by the server. We only want to arrive early. The acknowledge is implicit by the consolidated server movements that the server will send once the tick ends. So, we got it. We have made our custom delivery control message and this means that we don’t need and we don’t want TCP Protocol.

Let’s use UDP protocol for this commands. By doing this, clients are going to arrive sooner so it means that less movements are going to be rejected. This will allow us to reduce server tick time and we will get a faster game!!! In the other hand, we will loose some packets but we  will reduce at the minimum expression the lag between client and server that’s is our main goal.


So summarizing, we finally will use a
 mixed combinations of protocols to keep our net dialogue coherent and a fluid online game with little gap between server and client.