Race condition Scenario 2

Case1:

I was implementing a file watch service using apache common io framework.  Using this framework we can monitor a particular directory (in my case).  Any file created or deleted in this directory will be notified to the callback registered. The call-back is invoked in a background thread that is started when we start monitoring.  We need to provide a periodic check interval to look for the changes. I think it takes the snapshot of current and compares with the previous to detect the changes.

During the initialization of my server, I need to check if any files are there in those directories.  I have to treat them as newly created.  So I had code like

  1. Check for exiting files.
  2.  Register for any file changes.

The problem with the above order is that between 1 and 2 there is a chance that some files may get created. This is a race condition. To fix this issue, we have to flip the order. It is fine to process the same file multiple times. (Duplicates are possible with the reverse order)

 

Case2:

During the discussion of a design of workflow manager,  one of the problems mentioned was: there is a completion queue and there is a submission queue. The work item is submitted by the subscriber,  we should be able to avoid duplicate work submissions from the subscribers.  To implement this we have decided to keep the work item in the submission queue until completed.  Once completed we move it to the completion queue.  when moving the item from submission queue to completion queue we should do that atomically ( we should not see the item in both the queues or we should not see the item in none of the queues).  We can use transactions to do this move.  The problem is that the submitter/subscriber checks these queues before submitting the item to the submission queue (it is neither in submission nor in completion queues).  Looking at both the queues cannot be done using transactions.  Because of this, there are races possible which can lead to the submission of duplicate items to the queue.  One solution I proposed is

  1. Look into the submission Queue
  2. Look into the Completion Queue
  3. Look into the submission Queue again to confirm there are no changes done between 1 and 3.

 

 

 

Advertisements

Transaction Isolation levels in Database

Transaction in the database can happen in isolation without affecting other transaction. Database with strict mutual exclusion is not performance efficient. To avoid performance penalty,  Databases provide some flexibility to allow transactions to proceed with some level of understanding by applications. All the transactions that are happening on a database by multiple applications (or) threads in an application can undergo some transaction (concurrency) anomalies when there is no strict mutual exclusion.   The common anomalies we hear are dirty reads, non-repeatable reads, phantom reads, lost updates.

Dirty reads: –

Thread2 –  Transaction2: writes data (not committed to Database)

Thread1 – Trasaction1: read the data (It can see the uncommitted write by Thread2)

Problem: Transaction2 can roll back the data, but Trasaction1 uses the data.

Non-Repeatable Reads: –

Thread1 – Transaction1: reads data

Thread2 – Transaction2: writes the data (committed to Database)

Thread1 – Transaction1; reads data (can see the committed data)

Phantom reads:-

Thread1 –  Transaction1: reads data

Thread2 – Transaction2 Inserts the new row into the table

Thread1 – Transaction1: reads data ( can see the newly inserted data)

Databases provided different Isolation levels (visibility levels) to prevent the above-mentioned anomalies and also to a trade-off between performance and anomalies.

TRANSACTION_READ_UNCOMMITED (allows all anomalies)

TRANSACTION_READ_COMMITED (prevents dirty read)

TRANSACTION_REPEATABLE_READ (prevents dirty read, repeatable reads)

TRANSACTION _SERIALIZABLE (prevents dirty read, repeatable reads, phantom reads)

There is another isolation level, Snapshot Isolation, which prevents anomalies like serializable isolation (implemented using locks, but performance cost),  all reads work on a committed database snapshot copy (implemented using versioning),  does not take locks, so performance efficient.

In case of updates, a conflict between transactions will result in an aborted transaction if the isolation level is snapshot isolation/repeatable read.

Updates can be silently lost in case of READ_UNCOMMITED/READ_COMMITED isolation levels.

Database Design concepts

I was looking at someone’s DB schema diagrams and could not understand few notations and keywords. So I read through these concepts quickly and also watched some youtube videos.  Found https://www.studytonight.com/dbms has nice videos on normalization of DB tables.  Lucid chart website (https://www.lucidchart.com/pages/ER-diagram-symbols-and-meaning) gave a crisp explanation of these notations. The lucid chart is using Crow’s foot notation.

Some jargon I heard while reading more about DB design.

Primary key: Non-null, unique and not changing field in the DB table.
Foreign key:  it is referencing the primary key of some other table.
Normalization: a technique used in reducing the footprint of the redundant data and removing the insert/delete/update anomalies in the table design.

  • 1NF

    • Rule 1: Single Valued Attributes
    • Rule 2: Attribute Domain should not change (the type of value should match with schema)
    • Rule 3: Unique name for Attributes/Columns
    • Rule 4: Order doesn’t matter
  •  2NF
    • There should be no Partial Dependency. Dependency is all the other columns in a table depends and has some relation to the primary key. Partial Dependency, where an attribute in a table depends on only a part of the primary key and not on the whole key(in case primary key is a composite key). (teach name has a partial dependency on subject_id with student_id + subject_id primary key)
  •  3NF
    • Table should not have Transitive Dependency. Partial DependencyTransitive Dependency. When a non-prime attribute depends on other non-prime attributes rather than depending upon the prime attributes or primary key.
  • BCNF
    • For dependency A->B,  A should be super key. In simple words, it means, that for a dependency A → B, A cannot be a non-prime attribute, if B is a prime attribute.

Related SQL, I got clarification for a few keywords :

  • WHERE vs HAVING,  to specify the conditions with aggregated values in a SQL query where GROUPBY is used.
  • GROUPBY is used to do apply the aggregate functions like count, sum, avg on the selected columns.
  • ORDERBY is used to arrange the output in a specified column in ascending or descending order.
  • INDEX, if you create an index on a specific column, the querying on that field will be faster. otherwise, the values are stored in such a way the retrieval is faster with the primary key.
  •  CROSS JOIN, get all the combinations of two tables
  • INNER JOIN, mutually matching entries. (NATURAL JOIN is a verity of it.)
  • LEFT OUTER JOIN, matched data from the two tables and then the remaining rows of the left table and null from the right table’s columns.
  • RIGHT OUTER JOIN, matched data from the two tables and then the remaining rows of the right table and null from the left table’s columns.
  • FULL OUTER JOIN matched data from the two tables and then the remaining rows of both tables with null from the right/left table’s columns.
  • In the case of a Foreign key, if the primary key is deleted in the main table then we can say that delete all matching rows in the referenced table (Cascade delete). or fill with NULL values for those rows.

 

 

 

Component View of Webapp design

 

webapp componets

 

I was thinking about how web applications are designed and what are the different components and languages involved.

Java Script is most popularly used these days for small and mid size web applications and python for enterprise applications.

Angular.js is the single page web app design framework which runs the code inside browser.

Node.js is the backend framework which does database queries and simple data fetching tasks. The code is written in java script.

Python web application framework like flask, django are supported by uWSGI standards, and these are plugged in as modules for uWSGI servers.  All time-consuming tasks are queued using celery task manager framework and these are processed by a set of backend worker threads.

Nginx is used as a routing server for all the different traffic from users on a single IP and port combination. It routes the request to either node.js server or uWSGI server based on the request types.