MySQL v/s PostgreSQL – Relational Database Comparison

What is Relational Database?

A relational database is a digital database whose organization is based on the relational model of data, as proposed by E. F. Codd in 1970. The various software systems used to maintain relational databases are known as a Relational Database Management System (RDBMS).

In this article we will be comparing the two well known open-source relational database engines:

  1. MySQL
  2. PostgreSQL


MySQL, the most popular open-source SQL database management system, is developed, distributed, and supported by Oracle Corporation. With its proven performance, reliability and ease-of-use, MySQL has become the leading database choice for web-based applications, used by high profile web properties including Facebook, Twitter, YouTube, Yahoo! and many more.

Offshoots of MySQL (also known as forked version of MySQL) are:

  1. MariaDB
  2. Percona Server
  3. Drizzle (abandoned)
  4. WebScaleSQL (abandoned)


PostgreSQL is a powerful, open source object-relational database system. It is also known as most advanced open-source relational database engine. It has more than 15 years of active development and a proven architecture that has earned it a strong reputation for reliability, data integrity, and correctness. PostgreSQL is currently used by few of the technology giants like Apple, Red Hat, Juniper Networks, etc.

Databases like IBM Netezza and Amazon RedShift are the forked versions of PostgreSQL. A complete list of PostgreSQL derived forks and rebranded distributions is available on PostgreSQL derived databases.

MySQL v/s PostgreSQL

In the Stack Overflow Developer Survey 2017, MySQL was the most commonly used database engine which was expected. But the results showed that PostgreSQL is much more loved and is wanted by the Developer community as compared to the MySQL.

Here’s the summary of the most commonly evaluated features and abilities between MySQL and PostgreSQL:


Feature MySQL PostgresSQL
ACID Compliance MySQL is only ACID compliant when using InnoDB and NDB Cluster Storage engines. PostgreSQL is completely ACID compliant.
SQL Compliance MySQL is partially compliant on some of the versions (e.g does not support CHECK constraints).
(Refer document for more details)
PostgreSQL is largely SQL compliant.
(Refer document for more details)
Concurrency Support (MVCC) MySQL only has MVCC support in InnoDB. PostgreSQL has a very efficient MVCC implementation which achieves very high levels of concurrency.
Geo-spatial Query Support Available with OpenGIS extension. Available with PostGIS extension.
Geo-spatial support of PostgreSQL is far better if compared with MySQL.
NoSQL Features/JSON Support MySQL has JSON data type support but no other NoSQL feature. It does not support indexing for JSON. PostgreSQL supports JSON and other NoSQL features like native XML support and key-value pairs with HSTORE. It also supports indexing JSON data for faster access.
Replication methods Master-master replication
Master-slave replication
Master-slave replication
Security MySQL implements security based on Access Control Lists (ACLs) for all connections, queries, and other operations that a user may attempt to perform. There is also some support for SSL-encrypted connections between MySQL clients and servers. PostgreSQL has ROLES and inherited roles to set and maintain permissions. PostgreSQL has native SSL support for connections to encrypt client/server communications. It also has Row Level Security.
Supported on Operating Systems FreeBSD
License Open-source: GPL version 2. Open-source: BSD.
DB Engine Ranking #2 #4

Performance Comparison

Performance is an area that can only be gauged by evaluating the metrics of potential scenarios, as it depends on the pure requirements of the specific user and the nature of the application.


  • PostgreSQL is widely used in large systems where read and write speeds are crucial and data needs to validated. In addition, it supports a variety of performance optimizations that are available only in commercial solutions such as Geospatial data support, concurrency without read locks, and so on (e.g. Oracle, SQL Server).
  • Overall, PostgreSQL performance is utilized best in systems requiring execution of complex queries.
  • PostgreSQL performs well in OLTP/OLAP systems when read/write speeds are required and extensive data analysis is needed.
  • PostgreSQL also works well with Business Intelligence applications but is better suited for Data Warehousing and data analysis applications that require fast read/write speeds.


  • MySQL is a widely chosen for web based projects simply needing a plug and play database to store and read/write data.
  • MySQL performs well when quick setup is needed, such as plug and play systems. It is common for MySQL to underperform when strained by a heavy loads or when attempting to complete complex queries.
  • MySQL performs well in OLAP/OLTP systems when only read speeds are required.
  • MySQL + InnoDB provides very good read/write speeds for OLTP scenarios. Overall, MySQL performs well with high concurrency scenarios.
  • MySQL is reliable and works well with Business Intelligence applications, as business intelligence applications are typically read-heavy.

Below is the performance difference between MySQL v/s MariaDB v/s PostgreSQL (having default configuration) based on different types of queries:

Performance comparison with time taken in millisecond:

Performance comparison with time taken in natural log(millisecond):

This information is based on the benchmarking performed in Ng Heng Lim’s blog.


Public Cloud War: AWS v/s Google Cloud Platform v/s Windows Azure

In the public cloud arena there are three dominant players; Amazon Web Services, Microsoft Azure and Google Cloud Platform. Each offers slightly different services to the other, but they also overlap in many areas and directly compete on price.

The strengths and weaknesses of each can be key factors for you as a business looking to go down the cloud path. All three of them can be tried out on a small scale, through a proof of concept, so that you can compare the services and see which suits your needs the best.

However, because of their size, it can often be difficult to effectively compare what services are better suited to your requirements. Let’s compare these three big players by service category:

  1. Compute
  2. Storage
  3. Networking,
  4. Pricing structure.

But before reaching there, let’s compare the services offered by these cloud giants.


The quantity and quality of the services available on AWS is extremely broad and wide, and it builds up a huge set of opportunities for many different needs. There are specific tools for media transcoding and Streaming, a managed Directory Service, 4 different relational and NoSQL databases, and even a Desktops in the Cloud service that provides remote Windows desktops. The various services are really well integrated, and they provide a very comprehensive cloud service. If you don’t mind locking yourself into a single vendor, AWS has no rivals with regard to the completeness of its platform and the productivity level that you can reach thanks to it.

Google Cloud Platform

On the other side, Google Cloud Platform’s list of product is way smaller, and mostly focused on classic IaaS and PaaS services. The latter is probably the area where Google focused most of its efforts, given that Google App Engine has been the first service ever launched in GCP. Nevertheless, you can find the usual IaaS computing, object storage, relational and non relational databases too, and a few more sertvices for DNS and Endpoints. The difference among the two competitors is huge here. How much it impacts on your architecture mostly depends on what your needs are. Chances are that the services provided by GCP can suffice for most common needs, and as we will see very soon, Google Compute Engine has many strength and pros that AWS is lacking.

Windows Azure

Microsoft introduced their compute service as a preview in 2012, but didn’t make it generally available until May 2013. Azure users choose a VHD (Virtual Hard Disk), which is equivalent to Amazon’s AMI, to create a VM. A VHD can be either predefined by Microsoft, by third parties, or be user-defined. With each VM, you need to specify the number of cores and amount of memory. Azure isn’t anywhere near the size of AWS, but Microsoft has been working hard to match Amazon’s services and flexibility. For example in early 2014 Azure begun offering the data storage redundancy across data center regions – a feature dubbed Zone Redundant Storage (ZRS). It also introduced a new ‘Basic’ service level that removes auto-scaling and load balancing for those clients who don’t require such services, such as those running test environments and batch processing applications. For Windows-centric development or hosting, Azure offers slightly better options – Visual Studio, .NET and Windows programming languages such as VB and Visual C++ are all fully supported and well integrated.

This was the brief overview of these giant cloud players. Now it is the time to focus on the comparison between the services offered by them.

Comparison: Compute

AWS’s EC2 (Elastic Compute Cloud) provides Amazon’s core compute service, allowing users to configure virtual machines using either pre-configured or custom AMIs (machine images). You select the size, power, memory capacity, and number of VMs and choose from among different regions and availability zones within which to launch. EC2 also allows load balancing (ELB) and auto scaling. ELB distributes loads across instances for better performance, and auto scaling allow users to automatically scale available EC2 capacity up or down.

In 2012, Google introduced their computing cloud service: Google Compute Engine (GCE). Google Compute Engine lets users launch virtual machines, much like AWS, into regions and availability groups. However GCE didn’t become available for everyone until 2013. Since then Google has added its own enhancements, like load balancing, extended support for Operating Systems, live migration of VMs, faster persistent disks, and instances with more cores.

Also in 2012, Microsoft introduced their compute service as a preview, but didn’t make it generally available until May 2013. Azure users choose a VHD (Virtual Hard Disk), which is equivalent to Amazon’s AMI, to create a VM. A VHD can be either predefined by Microsoft, by third parties, or be user-defined. With each VM, you need to specify the number of cores and amount of memory.

Comparison: Storage and databases

AWS provides ephemeral (temporary) storage that is allocated once an instance is started and is destroyed when the instance is terminated. It provides Block Storage that is equivalent to hard disks, in that it can either be attached to any instance or kept separate. AWS also offers object storage with their S3 Service, and archiving services with Glacier. AWS fully supports relational and NoSQL databases and Big Data.

Google’s Cloud Platform similarly provides both temporary storage and persistent disks. For Object storage, GCP has Google Cloud Storage. GCP supports relational DBs through Google Cloud SQL. Technologies pioneered by Google, like Big Query, Big Table, and Hadoop, are naturally fully supported. Google’s Nearline offers archiving as cheap as Glacier, but with virtually no latency on recovery.

Azure uses temporary storage (D drive) and Page Blobs (Microsoft’s Block Storage option) for VM-based volumes. Block Blobs and Files serve for Object Storage. Azure supports both relational and NoSQL databases, and Big Data, through Windows Azure Table and HDInsight.

Comparison: Networking

Amazon’s Virtual Private Clouds (VPCs) and Azure’s Virtual Network (VNET) allow users to group VMs into isolated networks in the cloud. Using VPCs and VNETs, users can define a network topology, create subnets, route tables, private IP address ranges, and network gateways. There’s not much to choose between AWS vs Azure on this: they both have solutions to extend your on-premise data center into the public (or hybrid) cloud. Each Google Compute Engine instance belongs to a single network, which defines the address range and gateway address for all instances connected to it. Firewall rules can be applied to an instance, and it can receive a public IP address.

AWS is unique in providing Route 53, a DNS web service.

Comparison: Pricing Structure

AWS charges customers by rounding up the number of hours used, so the minimum use is one hour. AWS instances can be purchased using any one of three models:

  • on demand – customers pay for what they use without any upfront cost
  • reserved – customers reserve instances for 1 or 3 years with an upfront cost that is based on the utilization
  • spot – customers bid for the extra capacity available

GCP charges for instances by rounding up the number of minutes used, with a minimum of 10 minutes. Google recently announced new sustained-use pricing for compute services that will offer a simpler and more flexible approach to AWS’s reserved instances. Sustained-use pricing will discount the on-demand baseline hourly rate automatically as a particular instance is used for a larger percentage of the month.

Azure charges customers by rounding up the number of minutes used for on demand. Azure also offers short-term commitments with discounts.



One area where Google is particularly strong is Big Data, though. It comes as no surprise that a company like Google could profuse all of its experience in the area making some excellent products. And BigQuery really is, by the way. It allows you to analyze massive amounts of data in a really short time, even providing real-time insights about your datasets. And quite surprisingly for such a complex service, it’s also easy to get started with it.

I’m personally fond of AWS, and that would probably be my favourite choice, mostly because of the large variety of services available there. Nevertheless, I had a chance to play a little bit with GCE recently, and I’ve been impressed by the clean interface of the management console and by the bonus features I wrote you about. I don’t think I’d suggest anybody to migrate away from AWS, but there might be many cases where GCP can provide a better alternative to AWS. As always, it’s all about what you need to do, how you want to do it, what are your limits and constraints.


Elasticsearch 2.3 for Geo Location

Elasticsearch is an open-source, broadly-distributable, readily-scalable, enterprise-grade search engine. Accessible through an extensive and elaborate API, Elasticsearch can power extremely fast searches that support your data discovery applications.

ElasticSearch is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This is like retrieving pages in a book related to a keyword by scanning the index at the back of a book, as opposed to searching every word of every page of the book. This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages). ElasticSearch uses Apache Lucene to create and manage this inverted index.

ElasticSearch For Geolocations

In this article we will keep our scope limited to the features provided by Elasticsearch for the geolocation data. The beauty of Elasticsearch is that it allows you to combine geolocation with full-text search, structured search, and analytics.

For instance: show me cabs that are of luxury type, are within a radius of 5 KM, and are having minimum fare, and then rank them by a combination of user rating. Another example: show me nearest hospital, opened at 11 P.M., and has emergency facility.

For dealing with Geo data, ElasticSearch supports two data types:

  • Geo-points allow you to find points within a certain distance of another point, to calculate distances between two points for sorting or relevance scoring, or to aggregate into a grid to display on a map.
  • Geo-shapes, on the other hand, are used purely for filtering. They can be used to decide whether two shapes overlap, or whether one shape completely contains other shapes.

Geo Point data type

Fields of type geo_point accept latitude-longitude pairs, which can be used:

There are four ways to specify a geo-point:

  1. Geo-point expressed as an object, with lat and lon keys.
  2. Geo-point expressed as a string with the format: "lat,lon".
  3. Geo-point expressed as a geohash.
  4. Geo-point expressed as an array with the format: [ lon, lat]

Refer Elasticsearch: Geo-Point document for more information related to Geo-Points.

Note: String geo-points are ordered as lat,lon, while array geo-points are ordered as the reverse: lon,lat.

Geo Shape data type

The geo_shape datatype facilitates the indexing of and searching with arbitrary geo shapes such as rectangles and polygons. It should be used when either the data being indexed or the queries being executed contain shapes other than just points. You can query documents using this type using geo_shape Query.

Refer Elasticsearch: Geo-Shapes document for detailed information.


Geohashes are a way of encoding lat/lon points as strings. The original intention was to have a URL-friendly way of specifying geolocations, but geohashes have turned out to be a useful way of indexing geo-points and geo-shapes in databases. Geohashes divide the world into a grid of 32 cells—4 rows and 8 columns—each represented by a letter or number.

Refer Elasticsearch: Geohashes document for further information with examples.

Geo Aggregations

Geo-aggregations can be used to cluster geo-points into more manageable buckets. For example: to present information to the user on a map. Three aggregations work with fields of type geo_point:

  • geo_distance: Groups documents into concentric circles around a central point.
  • geohash_grid: Groups documents by geohash cell, for display on a map.
  • geo_bounds: Returns the lat/lon coordinates of a bounding box that would encompass all of the geo-points. This is useful for choosing the correct zoom level when displaying a map.

Check: MySQL, Logstash and Elasticsearch: Sync Realtime Geolocation Updates to know more on how to sync realtime updating data in MySQL with Elasticsearch.


Databases at a Glance: Datastore Based Categorization

There are various databases available these days. But the choice of database depends on what database or combination of databases best resolves your problem? Each database has its pros and cons. One of the major factor that decides the kind of database to be used is the kind of datastore you are dealing with.

Based on datastore, databases are mainly categorized into:

  1. Relational
  2. Key-Value
  3. Columnar
  4. Document-oriented
  5. Graph
  6. Time Series

While selecting a database, one should consider the following factors:

  • Usability: How user-friendly the system will be for all those members of staff required to use it?
  • Visualisation & Reporting: Review the ease of visually analysing and displaying results for any queries you run on your data, while making selections and deciding segments.
  • Security: Security of your data is an essential aspect of any database implementation. Business-sensitive data and any personal information you hold must be stored securely to adhere to regulations and to protect it from loss or theft.
  • Functionality: Confirm that the modules available in the data analysis software meet your business requirements.
  • Support & Development: Think about the support service the software company offers for its solution. Is this available during the hours you are likely to need support? Is the support offered by email, phone, other?
  • Integration: Does the system you are considering integrate with your other software systems such as your Email Marketing platform and CRM system?
  • Scalability: Ensure that the system has the capacity to grow with your data and your business.
  • Cost and Suitability: Whilst cost is obviously a factor in any business expenditure, it is wise to ensure that – as far as possible – your decision is based on the software being fit for purpose.
  • Hosting: Where is your system going to be located (physically or cloud)? What will be its downtime and maintenance window?

In this article we will keep our scope limited to the kind of database to be selected based on the datastore.

Relational Databases

Relational database management systems (RDBMSs) are set-theory-based systems implemented as two-dimensional tables with rows and columns.

RDBMS store the data into collection of tables, which might be related by common fields (database table columns). RDBMS also provide relational operators to manipulate the data stored into the database tables. Most RDBMS use SQL as database query language. Importantly, tables can join and morph into new, more complex tables, because of their mathematical basis in relational (set) theory.

There are lots of relational databases to choose from, like:

  • MySQL (open-source)
  • PostgreSQL (open-source)
  • Oracle (enterprise)
  • Microsoft SQL Server (enterprise)
  • H2 (open-source)
  • HSQLDB (open-source)
  • SQLite (open-source)

To view the list of all relational database available, check: List of relational database management systems.

Key-Value Databases

Key-value stores are probably the simplest form of database management systems. They can only store pairs of keys and values, as well as retrieve values when a key is known.

These simple systems are normally not adequate for complex applications. On the other hand, it is exactly this simplicity, that makes such systems attractive in certain circumstances. For example resource-efficient key-value stores are often applied in embedded systems or as high performance in-process databases.

Few examples of Key-Value databases are:

  • Redis
  • Memcached
  • Amazon DynamoDB
  • Riak KV
  • Hazelcast

Visit DB-Engines Ranking of Key-value Stores to get complete list of key-value based databases.

Columnar Databases

A columnar database, also known as a column-oriented database, is a database management system (DBMS) that stores data in columns rather than in rows as relational DBMSs. The main difference between a columnar database and a traditional row-oriented database are centered around performance, storage necessities and schema modifying techniques. The goal of a columnar database is to efficiently write and read data to and from hard disk storage and speed up the time it takes to return a query.

One of the main benefits of a columnar database is that data can be highly compressed allowing columnar operations — like MIN, MAX, SUM, COUNT and AVG— to be performed very rapidly. Another benefit is that because a column-based DBMS is self-indexing, it uses less disk space than a relational database management system (RDBMS) containing the same data.

Some of the common know columnar databases are:

  • Google’s BigTable
  • HBase
  • Cassandra
  • Amazon Redshift

To get the list of all the columnar databases, visit: List of column-oriented DBMSes.

Document-oriented Databases

Document-oriented databases are inherently a subclass of the key-value store. The difference lies in the way the data is processed; in a key-value store the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the document order to extract metadata that the database engine uses for further optimization. Although the difference is often moot due to tools in the systems, conceptually the document-store is designed to offer a richer experience with modern programming techniques. XML databases are a specific subclass of document-oriented databases that are optimized to extract their metadata from XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal.

Few of the document-oriented databases are:

  • CouchDB
  • MongoDB
  • Solr
  • ElasticSearch

Complete list of the available document-oriented databases in available at: Document-oriented database: Implementations.

Graph Databases

A graph database, also called a graph-oriented database, is a type of NoSQL database that uses graph theory to store, map and query relationships.

A graph database is essentially a collection of nodes and edges. Each node represents an entity (such as a person or business) and each edge represents a connection or relationship between two nodes. Every node in a graph database is defined by a unique identifier, a set of outgoing edges and/or incoming edges and a set of properties expressed as key/value pairs. Each edge is defined by a unique identifier, a starting-place and/or ending-place node and a set of properties.

Graph databases are well-suited for analyzing interconnections, which is why there has been a lot of interest in using graph databases to mine data from social media. Graph databases are also useful for working with data in business disciplines that involve complex relationships and dynamic schema, such as supply chain management, identifying the source of an IP telephony issue, etc.

Some of the most highly rated Graph databases are:

  • Neo4j
  • OrientDB
  • Titan
  • Virtuoso
  • AzrangoDB

To see the ranking of all the Graph databases, check: DB-Engines Ranking of Graph DBMS.

Time Series Databases

Time Series Databases (TSDBs) are databases that are optimized for time series data. Software with complex logic or business rules and high transaction volume for time series data may not be practical with traditional relational database management systems. Flat file databases are not a viable option either, if the data and transaction volume reaches a maximum threshold determined by the capacity of individual servers (processing power and storage capacity). Queries for historical data, replete with time ranges and roll ups and arbitrary time zone conversions are difficult in a relational database.

The TSDB allows users to create, enumerate, update and destroy various time series and organize them in some fashion. These series may be organized hierarchically and optionally have companion metadata available with them. The server often supports a number of basic calculations that work on a series as a whole, such as multiplying, adding, or otherwise combining various time series into a new time series. They can also filter on arbitrary patterns defined by the day of the week, low value filters, high value filters, or even have the values of one series filter another.

Most widely used Time Series databases are:

  • InfluxDB
  • KairosDB
  • OpenTSDB

Visit: Time series database: Example TSDB Systems to get the complete list of Time Series based databases.


What is the next BIG thing in technology?

There’s not a single aspect of the human experience that hasn’t been touched by technology. Everything from industry, to medicine, to how we work has been fundamentally reshaped by the technologies which emerged in the second half of the 20th century. At the start of the year 2000, very few people had access to the internet. But due to advancement in technology and telecom sector, almost everyone is now having the access to the internet. If you are reading this, you are already the part of this community.

Have you ever wondered what our future will look like? There’s a lot of game changing evolutions that are happening at this very moment, which may change the future of mankind. Here are some of the most promising new ideas that could change the world in the upcoming years.

1. Artificial Intelligence

AI or artificial intelligence is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using the rules to reach approximate or definite conclusions), and self-correction. We have artificial intelligence like Apple’s Siri, Amazon’s Alexa, and Google’s Assistant. Whenever the big players in tech are competing with one another on something like AI, it could become the next big thing.

There are a lot of advancements in this domain since the time Google open sourced their AI engine TensorFlow. That time is not too far when we’ll see a live example of J.A.R.V.I.S. from the Iron Man movie. (J.A.R.V.I.S. stands for Just A Rather Very Intelligent System)

2. Data Analytics and Prediction

Predictive analytics is an area of statistics that deals with extracting information from data and using it to predict trends and behavior patterns. It will enable us to make data-driven decision and distinguish the deviations from the predicted path. Whether it is the prediction of your age based on your daily lifestyle, climate change or deprecation of Ozone layer due to Global Warming, or opening of a new warehouse to a new geography; it can help mankind in a lot of ways.

Wouldn’t it be great if you have the ability to predict the impact of your each and every decision, and accordingly plan for your path? There are a lot of innovations happening this area and one of its example is Prophet, which is a Prediction Engine open-sourced by Facebook on Feb 23, 2017. I am expecting more advancements in this sector in the near future.

3. Space Travel

You must have seen space in the Sci-Fi movies, and would have wondered about going out there. Every time you see a rocket launched and thunderously clawing its way above the Earth’s atmosphere, this represents an expenditure of tens of millions or even hundreds of millions of dollars. Earlier, rockets have been a single-use object. But March 30, 2017 was the historic day when SpaceX successfully reused a rocket to launch a satellite into space.

Elon Musk, the founder of SpaceX in 2002 has long been a proponent of colonizing the planet Mars and he has used his considerable wealth to develop reusable rocket technology. I see this as a major advancement which increased the probability of Space Travel for normal people.

4. Virtual Reality

Virtual reality is an artificial environment that is created with software and presented to the user in such a way that the user suspends belief and accepts it as a real environment. It allows gamers to physically look around worlds of wonder, tourists to take a simulated tour of the moon, or potentially even taking videoconferencing to the next level, potential applications are endless.

It’s too early to tell, but Mark Zuckerberg believes in the concept so strongly that Facebook paid $2 billion to buy Oculus, the maker of the Oculus Rift headset. Mark announced this acquisition on his Facebook post on March 26, 2014.

5. Augmented Reality

Augmented reality is the integration of digital information with the user’s environment in real time. Unlike virtual reality, which creates a totally artificial environment, augmented reality uses the existing environment and overlays new information on top of it.

While companies like Facebook and Sony are busying themselves with virtual reality, other companies like Microsoft and Google want to bring digital images into the real world. With an augmented reality headset like Microsoft’s HoloLens, you can still see the world around you, but it’s filled with digital images that look like they’re right in front of you. It’s reality, but better.

Potential applications for augmented reality include education, skill-based training, advertising, and gaming. If augmented reality takes off, you may be able to see whatever kind of information you want overlaid across your vision. That could be useful in any number of ways.


Apache Kafka v/s RabbitMQ – Message Queue Comparison

Message Queue: Overview

Message Oriented Middleware or MOM concept involves the exchange of data between different applications using messages asynchronously. Using this mechanism, applications are decoupled and senders and receivers exist without the knowledge of each other. It becomes the responsibility of the messaging system (Message Oriented Middleware) to transfer the messages between applications. Queues allow you to store metadata for processing jobs at a later date. They can aid in the development of SOA (service-oriented architecture) by providing the flexibility to defer tasks to separate processes. When applied correctly, queues can dramatically increase the user experience of a web site by reducing load times.   

Use cases of Message Queue

  • Asynchronous: A lot of times, you don’t want to or need to process a message immediately. Message queues enable asynchronous processing, which allows you to put a message on the queue without processing it immediately. Queue up as many messages as you like, then process them at your leisure.
  • Decoupling: It’s extremely difficult to predict, at the start of a project, what the future needs of the project will be. By introducing a layer in between processes, message queues create an implicit, data-based interface that both processes implement. This allows you to extend and modify these processes independently, by simply ensuring they adhere to the same interface requirements.
  • Resilience: When part of your architecture fails, it doesn’t need to take the entire system down with it. Message queues decouple processes, so if a process that is processing messages from the queue fails, messages can still be added to the queue to be processed when the system recovers. This ability to accept requests that will be retried or processed at a later date is often the difference between an inconvenienced customer and a frustrated customer.
  • Redundancy: Sometimes processes fail when processing data. Unless that data is persisted, it’s lost forever. Queues mitigate this by persisting data until it has been fully processed. The put-get-delete paradigm, which many message queues use, requires a process to explicitly indicate that it has finished processing a message before the message is removed from the queue, ensuring your data is kept safe until you’re done with it.
  • Delivery Guarantees:The redundancy provided by message queues guarantees that a message will be processed eventually, so long as a process is reading the queue. No matter how many processes are pulling data from the queue, each message will only be processed a single time. This is made possible because retrieving a message “reserves” that message, temporarily removing it from the queue. Unless the client specifically states that it’s finished with that message, the message will be placed back on the queue to be processed after a configurable amount of time.
  • Ordering Guarantees: In a lot of situations, the order with which data is processed is important. Message queues are inherently ordered, and capable of providing guarantees that data will be processed in a specific order.
  • Scalability: Because message queues decouple your processes, it’s easy to scale up the rate with which messages are added to the queue or processed; simply add another process. No code needs to be changed, no configurations need to be tweaked. Scaling is as simple as adding more power.
  • Elasticity & Spikability: When your application hits the front page of Hacker News, you’re going to see unusual levels of traffic. Your application needs to be able to keep functioning with this increased load, but the traffic is anomaly, not the standard; it’s wasteful to have enough resources on standby to handle these spikes. Message queues will allow beleaguered components to struggle through the increased load, instead of getting overloaded with requests and failing completely.
  • Buffering: In any non-trivial system, there are going to be components that require different processing times. For example, it takes less time to upload an image than it does to apply a filter to it. Message queues help these tasks operate at peak efficiency by offering a buffer layer–the process writing to the queue can write as fast as it’s able to, instead of being constrained by the readiness of the process reading from the queue. This buffer helps control and optimize the speed with which data flows through your system.
  • Understanding Data Flow: In a distributed system, getting an overall sense of how long user actions take to complete and why is a huge problem. Message queues, through the rate with which they are processed, help to easily identify under-performing processes or areas where the data flow is not optimal.


RabbitMQ is one of the leading open-source messaging systems. It is written in Erlang, implements AMQP and is a very popular choice when messaging is involved. It supports both message persistence and replication, with well documented behaviour in case of e.g. partitions.

We’ll be testing a 3-node Rabbit cluster. To be sure that sends complete successfully, we’ll be using publisher confirms, a Rabbit extension to AMQP. The confirmations are cluster-wide, so this gives us pretty strong guarantees: that messages will be both written to disk, and replicated to the cluster (see the docs).

Such strong guarantees are probably one of the reasons for poor performance. A single-thread, single-node gives us 310 msgs/s sent&received. This scales nicely as we add nodes, up to 1 600 msgs/s:



Kafka takes a different approach to messaging. The server itself is a streaming publish-subscribe system. Each Kafka topic can have multiple partitions; by using more partitions, the consumers of the messages (and the throughput) may be scaled and concurrency of processing increased.

On top of publish-subscribe with partitions, a point-to-point messaging system is built, by putting a significant amount of logic into the consumers (in the other messaging systems we’ve looked at, it was the server that contained most of the message-consumed-by-one-consumer logic; here it’s the consumer).

Each consumer in a consumer group reads messages from a number of dedicated partitions; hence it doesn’t make sense to have more consumer threads than partitions. Messages aren’t acknowledged on server (which is a very important design difference!), but instead message offsets processed by consumers are written to Zookeeper, either automatically in the background, or manually. This allows Kafka to achieve much better performance.

Kafka v/s RabbitMQ v/s Other Queuing Service

As always, which message queue you choose depends on specific project requirements. :

  • SQS is a service, so especially if you are using the AWS cloud, it’s an easy choice: good performance and no setup required
  • if you are using Mongo, it is easy to build a replicated message queue on top of it, without the need to create and maintain a separate messaging cluster.
  • HornetQ has great performance with a very rich messaging interface and routing options
  • if you want to have high persistence guarantees, RabbitMQ ensures replication across the cluster and on disk on message send.
  • Kafka offers the best performance and scalability.

When looking only at the throughput, Kafka is a clear winner (unless we include SQS with multiple nodes, but as mentioned, that would be unfair):

It is also interesting to see how sending more messages in a batch improves the throughput. As already mentioned, when increasing the batch size from 10 to 100, Rabbit gets a 2x speedup, HornetQ a 1.2x speedup, and Kafka a 2.5x speedup, achieving about 89 000 msgs/s!


Use Kafka if you have a fire hose of events (100k+/sec) you need delivered in partitioned order ‘at least once’ with a mix of online and batch consumers, you want to be able to re-read messages, you can deal with current limitations around node-level HA (or can use trunk code), and/or you don’t mind supporting incubator-level software yourself via forums/IRC.

Use RabbitMQ if you have messages (20k+/sec) that need to be routed in complex ways to consumers, you want per-message delivery guarantees, you don’t care about ordered delivery, you need HA at the cluster-node level now, and/or you need 24×7 paid support in addition to forums/IRC.

Both have similar distributed deployment goals but they differ on message model semantics.

Neither offers great “filter/query” capabilities