Introduction to NoSQL and Polyglot Persistence - CodeProject
Today we are facing the rapid expansion of data-driven businesses, especially businesses that are web-based and have an enormous amount of data that is being transferred every second. In fact, almost 90% of all data on the web has been created in the last two years. Usually that data has to be stored somewhere, and usually, that somewhere is a database. Since every business is different, we have different needs when it comes to the ways we want to store the data. There are two routes we can take depending on our needs:
- Relational Databases
- NoSQL
Further on, I will try to cover the strengths and weaknesses of both, but I’ll go more into the depth of NoSQL databases and why we need to consider them as a solution to our problems.
Relational databases
Relational databases have been ruling the world since the mid-1980s. Even though somewhere in the 90’s, concepts like Object Databases threatened to take the reign – that never happened. In fact, the relational database concept was first proposed in 1970 by Edgar F. Codd. There are many reasons why people love relation databases and why they have been so dominant for such a long time.
Firstly, they are very close to the way we see the world. It comes very natural to us to create relations between some kind of data and connect them to a single entity. Relational databases also have strong consistency, meaning that once a change is made to some of the records, it can be seen by every application in that domain. This made them an ideal integration mechanism, with multiple applications sharing a common database and sharing common data.
Apart from that, as you probably already know, they have features like transactions and joins. This gives us a way to assign changes to multiple tables, and in case one instruction fails, previous changes would be rolled back. This is a fairly complex, but very powerful feature used in data sensitive systems. Also, SQL has pretty much become the standard language for a wide range of relational databases. And since it is very expressive and easy to learn, it gave enormous use to the relational databases. So, to sum it up, consistency, integration, expressiveness, and stability are the reasons we can find relation databases in almost every business solution today.
Problems of Relational Databases
The world has changed in the last few years. As already mentioned, it is the world where a huge amount of data is stored, transferred and manipulated with. And that data is mostly unstructured data. This doesn’t necessarily mean that this data lacks structure, but that structure is hard to grasp and define, especially in the tabular form. Plus, we do not have so many standalone applications today, for which relational databases were made. Instead, these applications are changed with distributed systems like social media, internet of things, microservices and such. This kind of world, with distributed systems and a lot of data, gave space for NoSQL database to blossom.
Probably the biggest problem that relational databases have today is the impedance mismatch. Impedance mismatch is a term used to describe the difference between the relational model and the in-memory data structures in general. For example, objects in application memory are usually saved in the form of a graph, which is different from the tabular way that data is saved in the relational databases. That is why we always need to modify, cut and re-arrange our data before we save it to the relational database.
Why is this such a big problem? Well, this is why it takes a long time to extend and maintain these databases, because these modifications are coupled with modifications in the application itself. And today, applications and systems have a demand for 24/7 availability. There is no longer time to turn these systems off for the weekend and update the database. Ability to manage data dynamically, to go to production fast, to develop software using agile methodologies are all adding up to the necessity for more flexibility than relational databases provide.
NoSQL
The term NoSQL originated as a twitter hashtag for a meet-up back in 2009. Sometimes it is translated as an acronym for – Not Only SQL, or short for – Non SQL. This term is pretty loose and it is used to cover a wide range of databases. These databases tried to tackle problems that relational databases had – flexibility, scalability, and performance. Nevertheless, in order to do so, they often sacrificed some of the good things that relational databases provided, such as expressive query language, secondary indexes, transactional mechanisms and strong consistency. That is why these databases are different from each other.
We could say that NoSQL databases have these common characteristics:
- They don’t use SQL – They, however, have their own querying languages, and often they make it similar to SQL since SQL is easy to learn. For example, Cassandra’s querying language is even called CQL.
- They are not relational databases
- Most of them are cluster-friendly – This was the initial idea – to store databases on multiple machines, but some NoSQL databases are Graph oriented.
- They don’t have a schema – in these databases, it is possible to add a field into “record” without first making changes to the structure itself. With schema, you have to know in advance what you want to store, which can be hard.
All the above are common characteristics, but certainly, by no means are they the definition of NoSQL. And at this point, I don’t think we’ll ever have a full, proper definition of NoSQL databases. This is probably good, since it goes hand in hand with NoSQL’s “free spirit”, so to say.
Types of NoSQL
As mentioned before, NoSQL databases made a shift in terms that they are no longer relational databases. But what does this mean? This means they no longer use a relational data model. A data model is a model through which we perceive data in the database. Relational database model can be visualized as a set of tables, in which each row represents a different record, a different entity. NoSQL databases use a different approach. Based on a data model there are few types of databases in NoSQL world:
- Key-Value Stores
- Column Stores
- Graph Stores
- Document Stores
- Multi-Model Databases
Key-Value Store is effectively associative array stored on a disk. It is a single key lookup, a dictionary so to say. The good thing about these databases is that they are very fast for reading. But these databases are not so good for reverse lookups, or if we look for additional analytics. An example of this type of databases is Redis.
Column Stores are the subset of NoSQL databases that somewhat kept the tabular form. What does this mean? Well, as you probably know, relational databases keep all their data in tabular form, where every row represents one entity. Every row is saved separately on the disk so we could say that data is aligned by rows. Reading this kind of database always reads the whole row, even if all that data is not necessary (let’s say we want just one column values).
Colum stores on the other hand pivot this approach a little bit. They store data in so-called columned families, in column order. For example, Ids of all records are saved first, then all of their names, etc. Why is this a big deal? This way it is possible to get the whole column in a much more efficient manner than by getting all rows and pulling specific values from each of them. We can basically get more information from the database in a single seek. Also, these databases can be easily compressed. It goes without saying that writes are very expensive. A typical example of these databases is Cassandra.
Graph stores use graph structures for queries with nodes, edges, and properties to represent and store data. They are used for storing a network of connections or relationship, e.g. social networks. Graph stores are a bit different from other NoSQL databases since they originated from a different problem with relational databases. They have a number of small records with a lot of relationships between them. An example of these databases is AllegroGraph.
One of the most popular types of NoSQL databases are document stores. They revolve around the concept of document. Documents are self-describing structures and are usually similar to each other but don’t have to be the same. Unlike the rows in relational databases where every row has to follow the same schema, documents can vary from each other and still belong to the same collection. Mongo DB is an example of document stores.
Multi-model databases are designed to handle multiple data models against a single integrated backend. They are a brand new thing in NoSQL world, and there will be much more buzz around this type of databases in the future.
Polyglot Persistence
Basically, there are two main reasons why engineers choose NoSQL databases for their problems:
- Minimizing the impedance mismatch – This effectively entails an increase in developers’ productivity. A lot of effort is spent on mapping data between in-memory data structures and a relational database. Sometimes a NoSQL database has a data model that fits better in the needs of our application, thus simplifying interaction of application code with the database. This way we have less code to develop and maintain. For example, in MEAN stack (M is for Mongo DB), the whole stack uses JSON objects, and interaction of application code and database is minimal.
- Embracing large scale data – today it is expensive to store a large amount of data in the relational databases. Businesses today have a need for capturing and processing a lot of data more quickly. Because many NoSQL databases are designed to run on clusters, they are a better fit for this kind of problem. The large scale clusters give us the possibility to store larger data sets and to process large amounts of analytic data. Also, NoSQL databases have different data models that may be better for processing that huge amount of data.
Does this mean that relational databases are dead? No, not at all. The relational data model is still the best choice for a great number of problems out there. Apart from that, relational databases have also been here for decades, meaning there are a bunch of tools for them and people are familiar with them, in comparison to the fairly new concept of NoSQL databases.
The only difference is in the way that we should perceive relational databases. They are no longer the only option for data storage. Now we need to understand the nature of the data and use different data stores in different situations. This point of view Martin Flower calls Polyglot Persistence – NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence[1]. This way of looking at data storage will lead us to solutions that will have multiple databases, and each database will be used for a different purpose. For example, we can use SQL for a financial part of the application but use MongoDB for products catalog and Cassandra for large scale analytics.
Conclusion
Polyglot Persistence has opened a new door in application development since never before have we had that many options and possibilities. Also, it raised a lot of questions. One of them is which database should be used in which situations. This approach creates a lot of complexity too since there is no simple mechanism for maintaining data consistency. Maby multi-model databases will fill this gap. Either way, it has not been this exciting in database world since mid-80s. What a time to be alive!
Read more posts from the author at Rubik’s Code[2].
This work is licensed under a Creative Commons Attribution 4.0 International License[3].
References
- ^ NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence (www.amazon.com)
- ^ Rubik’s Code (rubikscode.net)
- ^ Creative Commons Attribution 4.0 International License (creativecommons.org)
- ^ CodeProject (www.codeproject.com)
Comments