Recently we hit an issue where the client program has stuck. The client has been written in such way it sends requests to server in synchronous manner (sends the next request only after receiving the acknowledgement for current request). This issue happens intermittently. This symptom tells that there is some race condition.
On the server side, there are two threads
- submits a request to one of its internal submission queue.
- increments the io_submitted value.
- picks up the item from submission queue and does an asynchronous IO using libaio.
- libaio thread calls the call-back function passed to it once the IO is done.
- As part of call-back function aio thread enqueues the request into completion queue
- picks up the item from completion queue and increments io_completed value.
- the does a check io_submitted == io_completed to do next set of task.
- after completing the next set of tasks, sends a response to the client.
The problem is that the client is not receiving the acknowledgment. Why?
There is a race: Before thread 1 increments the io_submitted value, thread2 increments io_completed and does a comparison check. This can be possible if thread1 is scheduled out before we increment io_submitted value.
Couple of solutions:
- Move increment before submitting request o internal submission queue.
- Use spin lock to protect the io_submitted