Scaling Regression Test Cases through parallelism – A Cloud Native Approach


In the game of balancing out agility , quality and velocity for continuous business changes, automation regression keeps on growing along with your code base. A single line of code change in business logic results in creating multiple regression test cases. Also, it gets multi folded because this has to be tested on multiple devices and interfaces.

Engineering team spend lots of time in making sure our production systems are scaling and we are meeting our defined SLOs on production environment, at the same time test case regression time keeps on increasing and somehow takes a backseat until we realize that ‘Ohh it is taking half day to run all the test cases’, or sometimes may be more. This is where, your velocity of release goes for a toss. Any rerun and fixes need another iteration for same amount of time.

We went through with the same phase and our regression test case execution time increased in multifold and started touching to 12-15 hours to complete all the test cases on desktop/ simulators and multiple browsers. (How we scale on real mobile devices, is a topic for another blog post.)

We were able to reduce regression test case execution time by 75% using parallelism through cloud native approach & Making Regression test cases measurable through  real time analytics for quick recovery and replay of test cases. Lets dig this out in details here.

Problem Statement

HealthKart is a power house of brands, we have a single platform on which all our brand websites  (Muscleblaze.com, HKVitals.com. truebasics.com, Bgreen.com, Gritzo.com) and HealthKart.com marketplace is being run. Single change in core platform requires thousands of regression test cases to be run on different platforms and devices.

We have around 3000 cases which take full day of time for execution if any bug come in release during regression same time is take again to start over again.

Secondly, to get the failure report and rerun the failure of test cases we need to wait for the whole day to get the report because report came after all test cases execution.

Approaches for the solution

  1. Selenium GRID : Selenium Grid is a smart proxy server that makes it easy to run tests in parallel on multiple machines. This is done by routing commands to remote web browser instances, where one server acts as the hub. This hub routes test commands that are in JSON format to multiple registered Grid nodes.

The two major components of the Selenium Grid architecture are:

  • Hub is a server that accepts the access requests from the WebDriver client, routing the JSON test commands to the remote drives on nodes. It takes instructions from the client and executes them remotely on the various nodes in parallel
  • Node is a remote device that consists of a native OS and a remote WebDriver. It receives requests from the hub in the form of JSON test commands and executes them using WebDriver

Features :

  1. Parallel Test Execution (Local and Cloud-Based)
  2. Easy integration with existing selenium code .
  3. Seamless integration with existing code .
  4. Multi-Operating System Support

Cons :

  1. We have to maintain the nodes running through our own managed VM machines.Multiple nodes can provoke a full stop of test execution.
  2. Session Caches can create problem
  3. Challenges may emerge if multiple browsers are run on the same machine. We have to depend upon machine resources

2. Selenoid : Selenoid is a robust implementation of the Selenium hub using Docker containers to launch browsers. Fully isolated and reproducible environment.

Selenoid can launch an unhindered number of multiple browser versions concurrently.

Features :

  1. We don’t have to maintain the nodes running
  2. Containers provide enough level of isolation between browser processes so
      session cache is not a problem here
  3. Real browsers are available for all version .
  4. Easy integration with existing selenium code .
  5. Docker machine run on the fly whenever test cases are running and destroyed when test case got finished .

Cons : Community for the solution is too small .

3. Browerstack : Browerstack is third party tool which run your UI test suite in minutes with parallelization on a real browser and device cloud. Test on every commit without slowing down releases, and catch bugs early.

Features :

  1. Real devices and browsers are available with all versions
  2. We can run parallel test as per our package .
  3. Seamless integration with existing code .

Cons : Costly implementation. Price is too high with for more parallel .

Solution we implemented

  • We choose Selenoid due to ease of operability and its cloud native approach through Docker containers for achieving parallelism in distributed environment. Since we use different cloud providers in production vs dev, cloud native approach was life saver for us.
  • With parallel execution of test cases through Selenoid  we were able to bring the execution time to 3-4 hrs with 3000  cases . If any branch gets re-merged  due to any defect found is now planned and released in a couple of hours within time.
  • Integrating ELK stack for monitoring and analytics of test cases was required as test were getting executed in distributed environment and we were looking for log aggregation service which can be easily hooked in the solution architecture here. ELK stack was handy to go along. We were able to monitor and control and were able to find out the problems with test cases in real time.

Implementation of Selenoid

Selenoid is an open source project written in golang. It is an implementation of selenium grid using docker containers to launch browsers. Every time a new container is created for each test and it gets removed when the test ends. It helps in running tests in parallel mode. It also have selenoid UI which gives the clear picture of running test cases and capacity of running test case.

Scaling the test case capacity to run more parallel is just single configuration and also depends upon capacity of the VM where docker is launched.

Implementation of ELK

Often referred to as Elasticsearch, the ELK stack gives you the ability to send logs in JSON format and visualise that data through kibana. With ELK implementation we can get real time data of test case failure with reason and making it actionable to rerun if required from the console. 

What we achieved

  1. We were able to bring down our regression test case execution time to 75 % (12 hours to 3 hours). This has boosted our agility and velocity in the system at a larger scale.
  2. Measurable : With ELK implementation we can get real time data of test case failure with reason and making it actionable to rerun if required from the console.  This again a step ahead in agility and velocity of the system.

Benefit of Implementation is cost saving

  1. Agility – System is more agile and adaptive to change
  2. Velocity : Changes can be done at faster speed
  3. Ease of Scalability: Highly scalable structure . If we want to increase parallel execution of test count just we need to increase the value of parallelization configuration.
  4. Reliable : Real time analytics dashboard gives greater control on finding out the cause and replaying/fixing it in a faster way, this makes the system more reliable and adaptive.

Value Addition (Take Away from this)

Adding a parallel execution tool to our release cycle gives a clear picture of current test cases by making videos of particular cases which make debugging easier in case of bugs. Secondly, scaling the parallel execution is too easy which makes tester life easy .

The above content is an outcome of our experience while dealing with the above problem statement. Please feel free to make comments about your own experience.

Photo by Taylor Vick on Unsplash

Fixing MySQL errors (Lockwait timeout/ Deadlocks) In High Concurrent Spring Boot Transactional Systems

Nearly every engineer working with relational database management systems has encountered deadlocks or Lockwait Timeouts, or let’s be honest, been haunted by the nightmare of them.

HealthKart also encountered a similar issue. Back-to-back deadlocks hampered our user experience, especially during Sale Seasons owing to the high concurrency. This kept us all night leaving us begging even for a coffee break.

 While there are numerous blogs that help in understanding what deadlocks or Lockwait timeouts actually are and offer solutions to either avoid the issue or minimize it.
For example,

  • Make changes to the table schema, such as removing foreign key constraints to detach two tables, or adding indexes to minimize the rows scanned and locked.
  • Keep transactions small and short in duration to make them less prone to collision.
  • When modifying multiple tables within a transaction, or different sets of rows in the same table, do those operations in a consistent order each time. Then transactions form well-defined queues and do not deadlock

Such solutions can be applied to small applications with relatively light data entries in databases or applications that are being made from scratch.

But these solutions seem infeasible and difficult for an application like Healthkart, which has large backend services with numerous REST APIs, relatively long transaction blocks, numerous concurrent users, and relatively large data in databases.

Breaking the transaction blocks in heavy backend services without knowing the culprit transactions and altering the heavy database tables was relatively impossible for us.

So, it was clear that minimizing or avoiding the deadlocks will only make the monster more powerful. We had to figure out the crucial step that lies between understanding them and resolving them: Identifying the root cause/culprit transaction blocks participating in a deadlock and Lockwait Timeouts

Problem Statement

Error Logged in Application Log

We came across the following error in our application logs.

2023-05-03 15:05:36.463 ERROR [demo,be80487cd442ff4e,b9e94dec7cb710f0,false, ] 13787 --- [io-8080-exec-45] .engine.jdbc.spi.SqlExceptionHelper(142) : Deadlock found when trying to get lock; try restarting transaction

2023-05-03 15:05:36.462 ERROR [demo,be80487cd442ff4e,be80487cd442ff4e,false, ] 10384 --- [io-8080-exec-45] .hk.rest.resource.cart.CartResource(361) : CRITICAL ERROR - org.springframework.dao.CannotAcquireLockException: could not execute statement; SQL [n/a]; nested exception is org.hibernate.exception.LockAcquisitionException: could not execute statement

The above error caused the API to terminate with a status code of 500, resulting in a poor user experience. With the help of application logs, We were able to identify the API in which the deadlock occurred. But we failed to understand what queries were involved in it. We had to dig deeper.

Error Logged in MySql Error Log file

We had the following MySQL error log at our disposal

2023-05-03 15:04:36 7f3e95dfd700


TRANSACTION 46429732, ACTIVE 22 sec starting index read

mysql tables in use 1, locked 1

LOCK WAIT 14 lock struct(s), heap size 2936, 15 row lock(s), undo log entries 2

MySQL thread id 1043926, OS thread handle 0x7f3e9ad35700, query id 29356871 updating

UPDATE test_user_cart SET value_addition = 1 WHERE id = 34699


RECORD LOCKS space id 4467 page no 68318 n bits 272 index `PRIMARY` of table `test_user_cart` trx id 46429732 lock_mode X locks rec but not gap waiting


TRANSACTION 46429733, ACTIVE 20 sec starting index read

mysql tables in use 2, locked 2

4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1

MySQL thread id 1043927, OS thread handle 0x7f3e95dfd700, query id 29356872 updating

update test_user_rewards set reward_value = 0 where id = 34578

*** (2) HOLDS THE LOCK(S):

RECORD LOCKS space id 4467 page no 68318 n bits 272 index `PRIMARY` of table `test_user_cart` trx id 46429733 lock_mode X locks rec but not gap


RECORD LOCKS space id 5136 page no 1024353 n bits 472 index `contact_index` of table `hk_cat`.`user` trx id 46429733 lock_mode X waiting


According to the above log, update test_user_rewards set reward_value = 0 where id = 34578 and update test_user_cart SET value_addition = 1 WHERE id = 34699 were part of transactions involved in the deadlock. But Neither the `test_user_cart` nor `test_user_rewards` tables are interdependent through any entity relationship. How are they ending up in a deadlock?

Analysis of the MySQL and Application Logs.

Note: In Spring Boot, while executing a @Transactional method, all the SQL statements executed in the method share the same database transaction.

Taking the above statement into consideration and observing both log files and entities in our database, it seems that the queries mentioned in the MySQL error log, are not the sole cause of the problem. Instead, they are part of a transaction block that is contributing to the issue.

We tried to dig this further through APM tools however we realized that all modern APM tools fail to link database errors and its underlying APIs, hence of no use. 

Since we already knew which API ended with the “Deadlock found when trying to get lock; try restarting transaction” error. Hence we know one of the transaction blocks of the deadlock and the queries executed within the block.

The real challenge was identifying the successfully executed transaction block or API that contributed to the deadlock.

How We Overcame the Nightmare: Identifying the Root Cause of Deadlocks and LockWait Timeouts

In any relational database, each transaction block is assigned a unique transaction id. Our MySQL error log is logging the transaction ID of both successful as well as rolled-back transactions.
Our idea was to include this MySQL transaction ID in our application log so that we could pinpoint which successful API call executed the transaction and what queries are executed within that transaction which caused the other to roll back.

Note: Although we use Spring Boot for transaction management, it’s worth noting that the transaction ID obtained via the TransactionSynchronizationManager class in Spring Boot is not the same as the MySQL transaction ID found in our MySQL error log.

In MySQL, we can obtain the transaction information including transaction id, rows locked, and rows modified for the current transaction using the below query

SELECT tx.trx_id,trx_started, trx_rows_locked, trx_rows_modified FROM information_schema.innodb_trx tx WHERE tx.trx_mysql_thread_id = connection_id();

Implementation: Logging transaction id against each Transaction block in our Application Log

We incorporated the above query into the currently running transactions in our Spring Boot application by utilizing the Spring AOP (Aspect-Oriented Programming) concept.

Our implementation uses a POJO called TransactionMonitor to store transaction information in the thread-local of each incoming API request.

This TransactionMonitor POJO contains several fields to keep track of important transaction details such as the transaction ID, transaction name, parent method name, start time (in milliseconds), time taken for completion (in seconds), rows locked, and rows modified.

We have created a pointcut on the Spring @Transactional annotation with the following code: 

public void transactionalMethods(Transactional transactional{

This pointcut will match any method annotated with @Transactional.

Next, we registered the @Before advice for the above jointpoint. Here, we used the TransactionSynchronizationManager to override the beforeCompletion() and afterCompletion() methods of the currently running transaction.

public void profile(JoinPoint joinPoint, Transactional transactional) throws Throwable {

        Propagation level = transactional.propagation();
        String methodName=joinPoint.getSignature().getName();

        if(TransactionSynchronizationManager.isSynchronizationActive()) {
            TransactionSynchronizationManager.registerSynchronization(new TransactionSynchronization() {

                public void beforeCompletion() {
                    try {
                        executeBeforeTransactionMonitor(level, methodName);
                    }catch (Exception e){
                        log.error("Exception occurred while executing before commit {}",e);

                public void afterCompletion(int status){
                    try {
                        executeAfterTransactionMonitor(level, methodName);
                    }catch (Exception e){
                        log.error("Exception occurred while executing after completion {}",e);

It is worth noting how transaction propagation works in Spring. By default, the propagation level is set to REQUIRED. This means that Spring checks if there is an active transaction and if none exists, it creates a new one. Otherwise, the business logic appends to the currently active transaction.

It is important to handle cases where the parent method and some branching child methods have the @Transactional annotation. In such cases, the transactions will be the same. It is crucial to handle these cases properly to avoid triggering unnecessary MySQL query.

private void executeBeforeTransactionMonitorV1(Propagation level, String methodName) {

        String trxName = TransactionSynchronizationManager.getCurrentTransactionName();
        boolean readOnly = TransactionSynchronizationManager.isCurrentTransactionReadOnly();
        boolean executeQuery = false;

        if (!readOnly && (HkThreadLocal.getTrxDetails(trxName) == null || HkThreadLocal.getTrxDetails(trxName).getParentMethodName().equals(methodName))) {
            executeQuery = true;

        if (executeQuery) {
            TransactionMonitor res = getTransactionDetailsFromDb();
            if(res != null) {
                HkThreadLocal.setTrxDetails(trxName, res);
  • The purpose of this method is to check whether the current transaction has the same name and method as a previous transaction.
  • If the transaction name is the same but the method name is different, it signifies that one parent transaction executes many transactional methods.
    • In this case, the child transactional methods will not execute the query and will not reset the start time as the transaction is started when the parent transaction is started and completed only when that parent transaction is completed.
  • On the other hand, if the transaction name and method name are the same, the same transactional method is running in new transactions multiple times for a particular API.
    • In this case, the method will reset the start time every time and execute the query.
private void executeAfterTransactionMonitorV1(Propagation level, String methodName) {

        String trxName = TransactionSynchronizationManager.getCurrentTransactionName();
        if (HkThreadLocal.getTrxDetails(trxName) == null) {
            log.info("Information of transaction for transaction name " + trxName + " and method name " + methodName + " doesnt exists in thread-local");

        TransactionMonitor transactionMonitor = HkThreadLocal.getTrxDetails(trxName);

        transactionMonitor.setDiff((double) ((System.currentTimeMillis() - transactionMonitor.getStartTime()) / 1000));
        HkThreadLocal.setTrxDetails(trxName, transactionMonitor);
        TransactionMonitor monitor = HkThreadLocal.getTrxDetails(trxName);

        if(monitor != null) {
            if(monitor.getRowsLocked() > 0) {
                log.info("Transaction Monitoring Log : " + gson.toJson(monitor.getParameters(methodName)));


The method calculates the total time the transaction takes and then checks if the number of rows locked during the transaction is greater than 0. If it is, it logs the transaction monitoring information.

Finally, the below method retrieves data from a MySQL database.

private TransactionMonitor getTransactionDetailsFromDb(){

        String sql = "SELECT tx.trx_id,trx_started, trx_rows_locked, trx_rows_modified " +
                "FROM information_schema.innodb_trx tx WHERE tx.trx_mysql_thread_id = connection_id()";

        List<Object[]> res = entityManager.createNativeQuery(sql).getResultList();

        if(res != null && res.size() > 0) {
            Object[] result = res.get(0);
            if(result != null && result.length > 0) {
                TransactionMonitor trxMonitor = new TransactionMonitor();
                trxMonitor.setStartTime(result[1] != null ? ((Timestamp) result[1]).getTime() : 0L);
                trxMonitor.setRowsLocked(result[2] != null ? ((BigInteger) result[2]).intValue() : 0);
                trxMonitor.setRowsModified(result[3] != null ? ((BigInteger) result[3]).intValue() : 0);
                return trxMonitor;
        return null;

Note: Executing the query in the above approach requires the user to have the PROCESS privilege of MySQL.

Result Analysis

We already had the below log to identify the API that failed due to the error : 

In the case of Deadlock
2023-05-03 15:05:36.463 WARN [,fd583b14f8d89413,fd583b14f8d89413] 51565 --- [nio-8080-exec-4] o.h.engine.jdbc.spi.SqlExceptionHelper: SQL Error: 1213, SQLState: 40001

2023-05-03 15:05:36.464 ERROR [,fd583b14f8d89413,fd583b14f8d89413] 51565 --- [nio-8080-exec-4] o.h.engine.jdbc.spi.SqlExceptionHelper: Deadlock found when trying to get lock; try restarting transaction

In the case of LockWait timeout
2023-05-04 12:27:56.805  WARN [,8d2ce62e69e21438,8d2ce62e69e21438] 114297 --- [nio-8080-exec-4] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 1205, SQLState: 40001

2023-05-04 12:27:56.805 ERROR [,8d2ce62e69e21438,8d2ce62e69e21438] 114297 --- [nio-8080-exec-4] o.h.engine.jdbc.spi.SqlExceptionHelper   : Lock wait timeout exceeded; try restarting transaction

Finally, Our application will start generating the following log upon completion of the successful transaction:

2023-05-03 15:05:36.671 INFO [,3a6df4df0f3b46bb,3a6df4df0f3b46bb] 51565 --- [nio-8080-exec-1] com.example.demo.aop.DemoAspect: Transaction Monitoring Log: {"rowsModified":3,"currentMethodName":"deadlockExample","totalTime":82.0,"rowsLocked":15,"trxId":"46429732","trxName":"com.example.demo.service.TransactionErrorServiceImpl.deadlockExample"}

We can easily extract Transaction Id ( to detect deadlocks) and Total Time taken ( to detect which transaction exceeded configured lock wait timeout).

Now we can identify the API that caused the deadlock or Lockwait Timeout and look for a solution to resolve the issue.


Deadlock occurs when two or more transactions are waiting for each other to release locks on resources, resulting in a situation where none of the transactions can proceed. Here’s an example:

Suppose we have two transactions, T1 and T2, accessing the same database. T1 wants to update row 1 and then row 2, while T2 wants to update row 2 and then row 1. If T1 locks row 1 and then tries to lock row 2, and at the same time T2 locks row 2 and then tries to lock row 1, a deadlock will occur. Neither transaction can proceed because they are both waiting for the other to release the lock on the resource they need.

Here’s a simplified version of the SQL code for T1 and T2:

UPDATE table SET column1=value1 WHERE id=1;
UPDATE table SET column2=value2 WHERE id=2;
UPDATE table SET column2=value2 WHERE id=2;
UPDATE table SET column1=value1 WHERE id=1;

Lock wait timeouts refer to a situation where a transaction in a database is blocked from proceeding because it is waiting for a lock on a resource that is currently held by another transaction. When this happens, the blocked transaction will wait for a certain period of time for the lock to be released by the other transaction, before timing out and throwing a lock wait timeout error.

In some cases, the transaction holding the lock may be waiting for another resource, which in turn is held by a different transaction, creating a chain of dependencies that can lead to longer wait times and potential deadlock situations.

Lock wait timeouts can be a symptom of larger performance issues in a database system, and can lead to slow query response times, reduced throughput, and application errors. 

Reference: Spring Transaction propagation

The above outcome is based on our work at HealthKart, your experience may vary and would love to hear it in the comment section.

Photo by Kevin Ku: https://www.pexels.com/photo/data-codes-through-eyeglasses-577585/

How We Improved our App Startup and Navigation Time on Android App

Our Engineering team at HealthKart has a keen focus on improving the performance and making the system scalable for better user experience. Our Mobile Development team encountered couple of bottlenecks which was causing the issue on the performance side and hence a bad user experience score on the different performance metrics. Few of the important matrices that result in bad user experience were for example TTID, Slow start over time, Hot Start Over time, Activity Navigation Time etc.

Lets get started to understand these matrices and how we improved these for better user experience.

Bottlenecks for Performance – TTID is the Core Metrics

The Time to Initial Display (TTID) metric refers to the time it takes for an Android application to display its first frame to the user. This metric includes several factors, such as process initialization, activity creation, and loading of necessary resources, and it can vary depending on whether the application is starting from a cold or warm state.

If the application is starting from a completely closed state, meaning it’s a cold start-up, the TTID metric will include the time it takes for the system to initialize the application’s processes and load the necessary resources before displaying the first frame. This initial startup time can take longer than a warm start-up as the app has to load everything from scratch. In our case, the startup time was observed to be 2.57 seconds, which likely includes the time it takes to complete a cold start-up.

If the application is already running in the background or has been temporarily closed, meaning it’s a warm start-up, the TTID metric will still include the time it takes to create the activity and display the first frame, but some of the necessary resources may already be loaded in the device’s memory. Therefore, warm start-up time is generally faster than cold start-up time but still contributes to the overall TTID metric.

Android Profiler – Profiler tools which tells where do you stand

Android Profiler: This is a tool built into Android Studio that provides real-time data on app performance, including start-up times. You can use it to profile your app on a device or emulator, and it will give you detailed information on the start-up process, including the TTID metric you mentioned earlier. To access the profiler, go to the “View” menu in Android Studio, and select “Profiler”. These tools show us real-time graphs of our app’s memory use and allow us to capture a heap dump, force garbage collections, and track memory allocations.

After observing the code blocks, we worked to remove cases of memory leaks. Our team also worked to improve the view rendering time of every module/screen in the application. We first analyzed the time taken by each view to be drawn using the Profile GPU Rendering tool. This tool displays a scrolling histogram, which visually represents how much time it takes to render the frames of a UI window relative to a benchmark of 16ms per frame.

Reducing Android App Start-up Time with Baseline Profiling, Microbenchmarking, and App Startup Library

We integrated baseline profiling and microbenchmarking into our application to reduce this time. Baseline profiling improves code execution speed by around 30% from the first launch by avoiding interpretation and just-in-time (JIT) compilation steps for included code paths.

To generate and install a baseline profile, you must use at least the minimally supported versions of the Android Gradle Plugin, Macrobenchmark library, and Profile Installer. The baseline profile generates human-readable profile rules for the app and is compiled into binary form in the app (they can be found at assets/dexopt/baseline.prof).

We also used the App Startup library, which provides a performant way to initialize components at application startup instead of doing it manually and blocking the main thread.

By taking advantage of the above measures we improved following –

App startup speed improved by 41%

Slow warm start over time was improved by 50%

Slow hot start over time was improved by 30%.

Migrating to Android Jetpack Compose – For Smother Navigation between Activities

We migrated our application development from declarative to imperative development by developing it with Android Jetpack Compose, which uses the concept of recomposition. This also removes boilerplate code, makes debugging and testing easier, and results in a more smother navigation inside the application. See below the different activity navigation time that was reduced after migrating to compose framework.

These steps also helped increase the number of crash-free users for our application, resulting in performance improvement

Here are some links to help you migrate from XML to Android Compose.




The above outcome is a result of our experience and might differ from case to case basis. We would love to hear about your experience and any suggestion or feedback on above.

Photo by Sajad Nori on Unsplash

Moving to SSR and Managing Google Core Web Vitals

As a company we are always focussed on the performance of our core web sites. However, with Google recently announcing that Core Web Vitals will be used as signals to the SEO indexing on mobile, the performance became the number one priority. Having figured out that React Client side App may not be the best technology to achieve the Core Web Vitals in our case, we decided to jump on to the Server Side Rendering bandwagon. This blog is about how we achieved the migration of our React Client side app to Nextjs based Server Side Rendered App. Our partner Epsilon Delta helped  and guided us to achieve the below performance parameters.

Measurement Standards and Tools –

We used two kinds of measurement during the duration of our engagement.

Webpagetest by Catchpoint – For Synthetic measurement, we used the Webpagetest, as it provides a paid API interface to measure performance of pages synthetically from a specific browser, location and connection. You can store the data from each of the runs in your db and build a frontend to see the reports. While the advantage of webpagetest is that it provides a visually nice screen with all relevant performance parameters and waterfall for each run, it can not replace what Google is going to see for real users for all variations like network, browser, location etc. It is also difficult to capture performance experience of the logged in users through webpagetest as it requires a lot of scripting.

Gemini by EpsilonDelta – For real user measurement, we could not rely on just the Search Console or the Lighthouse or PageSpeed Insights, as primarily the real user data (the field data) is fetched from the Chrome User Experience db (Crux DB). The result set is generated based on the 75th percentile of the past 28 days data. Thus, instantaneous performance results are not available in Crux db once you push the performance optimization to production. It will be between 28 days to 56 days to know if the performance change helped in achieving the goal or not. In order to get the real time, real user Core Web Vitals, we decided to use the Gemini Rum Tool by Epsilon Delta.

Another advantage of Gemini is that it provides the data aggregated on page templates, url patterns and platform automatically. So we were able to identify the top page templates which need to be fixed on priority.

Key Performance Issues Encountered

Before November 2020 , Google was focusing on First Contentful Paint (FCP) as the most important parameters for the performance. However this changed when they announced the Core Web Vitals Concept, i.e. First Input Delay (FID), Largest Contentful Paint (LCP) and Cumulative Layout Shift (CLS). 

The FID is approximately representing the Interactivity of the web App.

The LCP is approximately representing the Paint Sequence of the Web App.

And the CLS is representing the page layout shifts or movements within the visible area of the page.

While the our site had the FCP in the green zone, the LCP and CLS were the newer performance parameters. The LCP for the our site was around 5.3 seconds and CLS was 0.14 for the mobile experience.

The then React Client side App codebase was having many inefficiencies, which resulted in having a large number of js and css files. The overall resource request count on each page was in excess of 350 requests. There were many other issues with the React Client side app code, where the configuration and the structure was not out of the box. There were a large number of customizations applied on the code, which made the code base very complex. Fixing the existing issues and then having desired performance level did not seem prudent and we decided to try the latest technology of Server Side Rendering.

Why SSR (Server Side Rendering) ?

The detailed analysis of the then performance numbers provided us with the following observations. 

Unplanned Layout Shifts – The Layout Shifts were happening as the Client side javascript was adding or removing the page structures as per the data coming from APIs and in order to not have the layout shifts, we would have needed to add a lot of placeholders. However in order to add the placeholders and attributes in the html, we needed to have the above the fold html of the structure served from the server side in the basic html first. This is not the case for most of the Client Side served architecture.

Delayed LCP – The Largest Contentful Paint was degraded as the LCP image was not loading earlier in the waterfall. However, even if we would have preloaded the LCP image, the base container for the LCP image was not present in the html. Hence the LCP performance would never have reached our desired level. Thus, in order to solve this problem, we needed to have the above the fold html served from the server side. Hence we started looking at the Server Side Rendering technologies.

Why NextJS ?

During this time, NextJS was already popular and some of the other sites had already tried building the web app using NextJS technology. There were quite a number of articles available to assist in building the site on Nextjs. Following features were very useful to decide the move to NextJS.

Launch of NextJS 11

The biggest reason we moved to NextJS is the release of Nextjs version 11.This version provided the ability to handle CLS of the server side rendered code. There was a demo and migration system available to migrate the React Client side app code to Nextjs Server side app code. Of Course that migration works only with certain conditions, but luckily for us, our React client side code fulfilled all the requirements.

The version has improved performance as compared to version 10. It has features like Script Optimization, Image Improvements to reduce CLS, image Placeholders to name a few. More details are available at [2].


Migration Planning and considerations

In order to do a neat migration, you need to consider the following tasks before you begin the migration.

Page Level Migration –

Nextjs handles the pages on url patterns only. So we decided to move our top 3 high trafficked templates Product Details Page, Product Listing Page and Homepage sequentially. Each of these pages were being served through a url pattern.

Service Worker –

In any domain, there can be only one service worker. Since our React Client side app already had a service worker, we needed to ensure that the same service worker was copied to NextJS code on Production deployment. We planned static resource caching based on the traffic share being served from the 2 code bases. I.e. React Client side and Nextjs server side. So initially the service worker was served from React Client side code. Once 2 pages were migrated to Server Side, we moved the service worker to Nextjs code base.

Load Balancer –

On Load Balancer as well, one needs to ensure that proper routing happens for url patterns which are getting migrated. Fortunately modern load balancers provide plenty of options to handle the cases. 

Soft to Hard Routing –

The basis of the PWA app was to provide a soft routed based experience for pages after the landing pages for the users. However, with 2 code bases, the internal routing would have caused the issue. One needs to disable the page level soft routing for the page which is being developed for moving to a new code base. This way you can ensure that the request is always reaching to the load balancer for effective routing. Once the migration is complete for each page, you can move the routing back to soft.

SEO Meta data – 

As in any new code development, one needs to ensure that all the SEO meta data is present in the page as compared to the existing page and a quick check can be done by running Google SEO url check.

Custom Server – 

A custom Next.js server allows you to  handle specific URLs and bot traffic. We wanted to redirect some urls to a new path & redirect bots traffic to our prerender system. Next js has a redirects function which can be configured in next.config.json by adding a JSON object. But as our list of urls is dynamically updated via database /cache, we added logic in the custom server.  In future with Nextjs12 we will move this logic to middleware.

API changes –

During the migration we ensured that the existing APIs will be used to the maximum extent in the new Nextjs code. Fortunately our APIs were already designed in a way that the output was separated out for above the fold and below the fold data. Only in cases where we wanted to limit the data coming through the APIs and any configuration was not possible, that we created new APIs.

JS Migration – 

We started with creating a routing structure using folders and file names inside pages directory. For example, for the product details page, we used dynamic routes to catch all paths by adding three dots (…) inside the brackets.

We copied our React components JS files related to a particular page from the existing codebase to the src folder in the NextJS codebase. We identified browser level functions calls and made changes to make them compatible with SSR.

Further, we divided our components by code usage between above the fold & below the fold area. Using the dynamic importing SSR false option, we  lazy-loaded below the fold components. Common HTML code & some third party library code was added in _document.js

CSS Migration –

We imported all common css files used on our site like header.css, footer.css & third party css inside pages/_app.js file. Component-level css used for styles are used in a particular component. Components.module.css files are imported inside the component js file. 

Performance Optimization Considerations

Performance is a  very wide term which is being used in the technical community. The term is used in various aspects like backend performance, database performance, CPU, memory performance, javascript performance etc. Given that Google is heavily investing and promoting the concept of Core Web Vitals, we decided that the focus of our performance goal will be core web vitals.

We also firmly believe that even though FCP is not the Core Web Vitals anymore, FCP is equally important for perceived user experience. Post the url entered on the browser, the longer the blank white screen shows to the user, the more is the chances of the user bouncing. While we had already achieved a certain number for our FCP performance, we wanted to ensure that the FCP does not degrade much on account of Server Side Rendering

Server Side vs Client Side –

Thus, it was important for the html size to be restricted, so that the html generation time on the server is less and FCP is maintained. We painstakingly moved through each major page template and identified which page components are fit for Server Side Rendering and which are fit for Client Side rendering. This helped in reducing the html size considerably.

Preconnect calls – 

Preconnect is the directive available which directs browsers to initiate DNS connect and SSL in case of domains. This helps in saving the initial time before the actual resource fetch.

<link rel=”preconnect” href=”https://fonts.gstatic.com&#8221; crossorigin=””>

Preload LCP image –

LCP being the most important parameters in the core web vitals, it becomes very important to identify the LCP at the time of designing the page and identify if the LCP is caused because of an image or a text. In our case, it was the image which was causing the LCP. We ensured that the image is preloaded using the preload directive.

<link rel=”preload” as=”image” href=”https://img7.hkrtcdn.com/16668/bnr_1666786_o.jpg”&gt;

Making CSS Asynchronous – 

CSS is a render-blocking resource. We wanted to inline critical css used in above the fold area of the page & make all css tags async to improve FCP and eventually LCP.We used critters npm module which extracts, minifies, inlines above-the-fold CSS  and makes CSS link tags async in server side rendered HTML. During the implementation, we found an issue with the plugin while using the assetPrefix config for CDN path. The issue was causing only base domain urls for static resources to be used in the plugin. There was no option for CDN urls or it was not working. While we raised the issue with the NextJS team, there was no fix available. So we added a patch in our code to include CDN urls for static resources. As of now, the issue has been fixed by the NextJS team and the fix is available in the latest NextJS stable version.

Reducing the Resource Count –

In the NextJS project, we moved from component based chunks to a methodology where the js and css files were split up based on global code where common functionality was combined for above the fold working and local code where non common functionality was combined. We also ensured that the above the fold javascript was rendered server side vs below the fold on client side. This helped reduce the resource count on the Product Listing Page from 333 to 249 and on the Product Details Page from 758 to 245.

Javascript size reduction in Nextjs –

We did a thorough analysis of the javascript used at the server side and client side in order to identify unused javascripts, analyze which components and libraries are part of a bundle, and check if a third party library showed up unexpectedly in a bundle. There are few possible areas of improvement, e.g. Encryption algorithms used in the javascript code if any. We ensured that no third party js is integrated in the server side js except when it’s absolutely necessary and added as inline code.

We used Next.js Webpack Bundle Analyzer to analyze the code bundles that are generated but appear to be unused. this tool provides both server side & client side reports in html file this helps to inspect what’s taking the most space in the bundles

Observations –

Post completion of the project, we can identify a few key points related to our goal of core web vitals improvement.


Performance Optimization –

We were able to achieve close to 50% improvement in real user performance for LCP and bring it very close to 2.5 sec. This was possible because we used the server side rendered technique to get the very element in html causing the LCP.

The Layout Shifts on the client side also reduced to a great extent as we were stitching the page on the server side for above the fold area. Thus any major movements in CLS were not happening in the client side area.

Greater control on served experience –

Because of the server side rendered experience, we were able to decide the experience which we wanted to serve to the user at the time of html generation. So we had better control on what elements we wanted to show and what not. Most importantly this also helped us decide on our category based views and type of user based view. .i.e admin, premium vs normal user etc. In the client side environment, this very problem caused a lot of complexity in our earlier React based code.


Higher backend latency – 

One of the tradeoffs of moving to server side rendering is the higher backend latency. If not managed correctly, it can degrade your FCP and by that means also degrade the LCP. One needs to carefully consider the structure of the page to ensure minimal increase in the backend latency. We did this by moving the below the fold functionality to client side rendering technique. Thus html generation time was reduced. We also created or modified existing API giving precise json data for backend integration. Better API response time helped us in reducing the backend latency. We also reduced the state variable size, as in nextjs, the state variable comes in the html. Larger state variables will cause the html size to increase.

Lesser support from Tech Community –

When we started working on the project, NextJS was relatively newer technology and lesser documentation was available. There were limited technical issues that were mentioned in various forums or the community had not brainstormed on the possible solutions by that time. However, this is not the case anymore and NextJS is a widely accepted technology.

Infrastructure Investment –

One of the other areas which need attention is the sizing of the backend infrastructure and possibly the infrastructure on API side. By design, the server side rendering moves the processing to the servers. Thus naturally one needs more capacity servers to serve the front end code as compared to the React client side app. We believe this is a small price one needs to pay to improve the Core Web Vitals.


With the Nextjs migration, we were able to achieve the performance goal for Core Web Vitals in the fixed amount of time having no major slippages. The result graphs are as follows.

Mobile Homepage –

Mobile L1 Category Listing –

Mobile L2 Category Listing –

Mobile Product Details Page –

LCP 75th Overall in Gemini

CLS 75th Overall in Gemini

LCP PDP Mobile 75th in Gemini

CLS PDP Mobile 75th in Gemini

LCP Category Listing Page 75th Gemini

CLS 75th Category Listing Page Gemini

Homepage 75th Gemini

CLS Homepage 75th Gemini

The above content is an outcome of our experience while working with above problem statement. Please do feel free to reach out and comment in case of any feedback and suggestion.

Using EDA and K-Means for food similarity and diet chart

Health and wellness is complex thing and needs a holistic approach in your lifestyle to achieve and maintain it . Food is an important pillar for your health and fitness. Complexity increases as we move into details of food items to classify what is healthy and what is unhealthy that too taking the consideration of health and fitness goal of an individual. Food that might be suggested for a particular fitness goal might not be a fit for other fitness goal. High carbs food might be preferred in the cases where weight gain is the fitness goal however things might be opposite when it comes to weight loss.

There are thousand and millions of food items in the world and our task was to classify and suggest food items while looking at users health and fitness objective. Also, algo should be able to give recommendation of healthy food for the item which user eats in his daily routine.

The problem statement we had was to prepare a diet chart for users based on their goals. Every goal had its own calorie requirements, percentages of primary nutrients i.e. carbohydrate, fat, protein, and fibres. It made a lot of sense in this context to group foods together based on these properties to classify them on the basis of them being a high carb, high protein or high fat food item. Hence we decided to analyse the data and create clusters out of it.

We divided out process in following steps:

  1. Reading, Understanding, and visualising data.
  2. Preparing data for modelling.
  3. Creating Model.
  4. Verifying accuracy of our model.

Lets get started by Reading and understanding Data.

We in total were provided with 1940 records having 88 attributes. Out of which, according to our business requirement we needed attributes like foodName, carbs, protein, fat, fibre, weight, calorie, saturatedFat and volume.

Several entries in our dataset had missing values there can be two reasons for it.

  1. It was intentionally left out as some food items don’t contain any such attributes. It simply means the missing values represents zero.
  2. There was some error collecting data and during data entry those values were skipped.

Upon consulting datasource we imputed missing values with zero.

Next the calorie in the food items contained calorie from all the minerals and nutrients components in our food but since we are only concerned about few of those nutrients so we calculate calories using those only. And according to standard formula the calories comes out as.

calorie = 4*carbs + 9*fat + 4*protein

Hence, we came up with a derived metrics calorie_calculated using the following formulae.

Standardising values:

Columns carbs, fat, protein and fibre are in grams but for our analysis purposes we need to convert and standardize those to its calories representation. And since fibre is a non contributor in calorie we convert it to corresponding content per unit weight of food item.

Its very important in clustering algorithm for our data to not be correlated. But as we see from the heatmap presented below that as calories of food items increases so does fat, carbs, protein. In order to remove this correlation we took a ration with the calculated calorie.

Now once our data is clean and correlations are handled lets move to next step i.e. clustering.

What is clustering?

Cluster is a task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Its a kind of Unsupervised Learning, as we don’t provide any labels to data and we are trying to distinguish data in subgroups based on the features provided.

What is K-Means Clustering ?

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid. It tries to make intra-cluster datapoints as similar as possible while keeping the clusters as far as possible.

It involves following steps:

  1. choose number of clusters, lets say k. This is the k in K-Means clustering
  2. select k random points in data as centroid.
  3. Measure the distance between first point and K initial clusters.
  4. Assign the first point to the nearest cluster. And we do the same step 3 and 4 for the rest points. And once all the points are in cluster we move on to next step.
  5. Calculate the mean of each cluster i.e. centroid of each cluster.
  6. Now we measure the distance from the new centroid and repeat step to 6. Once the clustering didn’t change at all during the last iteration we are done.

We can asses the quality of clustering by adding up the variation within each cluster. Since k-means clustering cant see the best clustering, its only option is to keep track of these clusters, and their total variance, and do the whole thing over again with different starting points.

Since, K-Means rely heavily on the distance its very important for our features to be scaled with mean around zero and with unit standard deviation. And the best feature scaling technique to use in this case is Standardisation.

The next question is what should be the value of K ?

For this we use what is called Elbow Curve method. It gives a good idea what K value should be based on Sum of squared distance. We pick k at the spot where SSE starts to flatten out and forming an elbow.

import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

required_cols = ['carbs', 'fat', 'fibre', 'protein']

scalar = StandardScaler()
df[required_cols] = scalar.fit_transform(df[required_cols])
df[required_cols] = df[required_cols].fillna(0)

wcss = []
 for i in range(1, 11):
     kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, max_iter=300)
 plt.plot(range(1, 11), wcss)
 plt.title('Elbow curve')
 plt.xlabel("Number of clusters")

we get the above curve. From this we can say that optimal cluster and value of K should be around 4.

Analysis of Clustering

We are using Silhouette Analysis to understand the performance of our clustering.

Silhouette analysis can be used to determine the degree of separation between clusters. For each sample:

  • Compute the average distance from all data points in the same cluster (ai).
  • Compute the average distance from all data points in the closest cluster (bi).
  • Compute the coefficient:
Image for post

The coefficient can take values in the interval [-1, 1].

  • If it is 0 –> the sample is very close to the neighboring clusters.
  • It it is 1 –> the sample is far away from the neighboring clusters.
  • It it is -1 –> the sample is assigned to the wrong clusters.

Therefore, we want the coefficients to be as big as possible and close to 1 to have a good clusters. Lets analyse the silhouette score in our case.

result = {}
for i in range(2, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, max_iter=300, n_init=10)
    pred = kmeans.predict(df[required_cols])
    result[i] = silhouette_score(df[required_cols], pred, metric='euclidean')

We get result as:

{2: 0.31757107035913174,  3: 0.34337412758235525,  4: 0.3601443169380033,  5: 0.2970926954241235,  6: 0.29883645610373294,  7: 0.3075310165352718,  8: 0.313105441606524,  9: 0.2902622193837789,  10: 0.29641563619062317}

We can clearly see that for k = 4 we have the highest value of silhouette score. Hence 4 as an optimal value of K is a good choice for us.

Once we have k; we performed K-Means and formulated our cluster.

Next, we have prediction for values. Let’s say, we get nutrition composition for a specific goal. What we do, is scale that data in format that out model accepts and predict the cluster of the corresponding given composition.

y_pred = model.predict([food_item])
label_index = np.where(model.labels_ == y_pred[0])

As we get the label_index we filter out our food from our data and calculate the euclidian distance of each food item for the given composition.

dist = [np.linalg.norm(df[index] - food_item) for index in label_index[0]]

By this way, we can have the food items that are very closely related to the provided composition. And hence, we can prepare the diet the way we want. Like if we want to further filter out the data obtained from clustering into veg/NonVeg type etc we can perform those filtering.

The above content is an outcome of our experience while working with above problem statement. Please do feel free to reach out and comment in case of any feedback and suggestion.

Photo by Lily Banse on Unsplash

How to track sleep through Android app


Our HealthKart application helps users to achieve help and fitness goal through our digital platform. Achieving help and fitness goal requires lots of things to be incorporated in daily routine and sleep is an important parameter for the same.

Sleep tracking can be done through couple of methodology and one of the popular way is to track it through smart band/watches. HealthKart app has integration with various health and fitness bands to track the sleep however we wanted to have another way to track the sleep of users through much easier way so that we can have maximize the data inputs from our users on this front.

In today’s time, all people use their phone from morning to night and first use the phone as soon as they wake up in the morning. So by using the user activity on the phone calculating the sleep time.

Now the question is what all activities are capturing for this. So the answer is only two and these are below.

  • User device screen comes in On Mode by user intention or any other application eg. phone ringing.
  • User device screen goes into Off Mode.

Android Components used for this

  • Started Service
  • BroadcastReceiver

Steps for using Android Components

  1. Create a SleepTrackerService that extends Service Class.
class SleepTrackerService : Service() {

  override fun onBind(p0: Intent?): IBinder? {
    return null

  override fun onStartCommand(intent: Intent?, flags: Int, startId: Int): Int {
    Log.i(TAG, "Sleep Tracker Service")

  override fun onDestroy() {


2. Make two BroadcastReceiver  ScreenOnReceiver and ScreenOffReceiver. So these two receivers for checking when the screen is coming in ON Mode and OFF Mode.  Register both the receivers in the Service class of onStartCommand method.

private var screenOffReceiver: ScreenOFFReceiver? = null
private var screenOnReceiver: ScreenONReceiver? = null

screenOffReceiver = ScreenOFFReceiver()
val offFilter = IntentFilter(Intent.ACTION_SCREEN_OFF)
registerReceiver(screenOffReceiver, offFilter)

screenOnReceiver = ScreenONReceiver()
val onFilter = IntentFilter(Intent.ACTION_SCREEN_ON)
registerReceiver(screenOnReceiver, onFilter)

3. To keep running the service in the background and the kill state of the application used the foreground service.

val notificationManager =getSystemService(Context.NOTIFICATION_SERVICE) as NotificationManager

val notificationIntent =
  Intent(this, SleepTrackerActivity::class.java)
val uniqueInt = (System.currentTimeMillis() and 0xfffffff).toInt()
val pendingIntent =

val builder: NotificationCompat.Builder =
  NotificationCompat.Builder(this, SLEEP_CHANNEL_ID)
builder.apply {
  setContentText("Sleep Tracking")
  priority = NotificationCompat.PRIORITY_HIGH
  addAction(R.drawable.blue_button_background, "TURN OFF", pendingIntent)

val notification = builder.build()
notification.flags = Notification.FLAG_ONGOING_EVENT

startForeground(SLEEP_NOTIFICATION_SHOW_ID, notification)

4. Now calculate the timing in ScreenOnReceiver and ScreenOffReceiver.

inner class ScreenOFFReceiver : BroadcastReceiver() {
  override fun onReceive(context: Context, intent: Intent) {

inner class ScreenONReceiver : BroadcastReceiver() {
  override fun onReceive(context: Context, intent: Intent) {

5. Unregister the receivers in onDestroy method when service is destroyed.

override fun onDestroy() {
  screenOffReceiver?.let {
  screenOnReceiver?.let {
  with(NotificationManagerCompat.from(this)) {

We capture these on/off screen event data for user and send it to backend for calculating the sleep behind scene through our algorithm.

This methodology is much easier to implement at the same time user does not need to wear his gadget all the time. Obviously there are few trade off here too however this was the balanced approach to maximize the data inputs from our end users.

This tutorial is outcome of our own experience of implementing the sleep tracking. Your suggestions and feedbacks are heartily welcome.

Photo by Lauren Kay on Unsplash

Adding In App Video Chat Support – Things to consider

Pandemic has given exponential rise to video communication adoption to the digital platform to have better personal support service to customers. If you have a digital property, In-App video chat becomes an important aspect of the same. From a technology perspective, there are multiple options to choose from when it comes to implementation for the video chat. Lots of question will come into mind, which protocol to choose, should I use open source, should I use hosted services (CPaaS), how about the pricing and many more.

HealthKart provides nutrition services to its customer through App video chat support. Customers can initiate the Video chat service with the Nutritionist or they will get a video call in their app at the scheduled time of appointment and can have one to one consultation with the nutritionist. These kinds of implementations require few things to be considered before we actually jump to the implementation part. Let’s discuss this in detail.

Choosing right communication protocol – Technology changes rapidly and keeps on evolving every day, as a result, we keep on getting new frameworks, tools, and protocols on warp speed. What was working a couple of years back might not be relevant today. RTMP (Real-Time Messaging Protocol) which used to rule the streaming protocol earlier was replaced by HSL (HTTP Live Streaming ) from Apple and DASH (Dynamic Adaptive Streaming over HTTP) based streaming protocols. WebRTC is something new that is a game-changer as it is based on P2P and configuration support to work on TCP and UDP both. It is primarily designed for data streaming for the browser to browser support.

Looking at the WebRTC advantages we at HealthKart opt to go with WebRTC based streaming frameworks for implementing In-App Video Chat Support. 

Open source Vs CPaaS (Build Vs Buy) – This might be the bit tricky call to make when it comes to whether you should build it in-house or have some hosted solution like CPaaS (Communication Platform as a Service). By any chance, if you choose to build it in house, you have to put lots of effort to find out the right server and client tools to make it work. Also need to work on its scalability and reliability part. Looking at the complexity of the service and in house capabilities and priorities we at HealthKart choose to not built this in- house instead of just looking for the hosted/CPaaS services readily available in the market.

Though this call might be contextual based on the individual needs of the organization and may vary need basis. If you need more information on what are the things one should consider, please read out our other blog post about the same here

CPaaS – TokBox(Now Vonage) Vs Twilio Vs Others -If you decided to go ahead with hosted services, the next thing to decide would be which one to use. There are multiple CPaaS providers available in the market and one has to decide which one to use looking at the various aspects. TokBox and Twilio are leading the market on the same and we evaluated both on highlighted aspects below.

  1. Easy of Use – No matter which provider you choose, you have to pick their SDKs, read their developer docs and get that integrated into your app. There are lots of terminologies too that have to be understood like session Id, relay mode, routed mode etc. Tokbox and Twilio both have quite a descriptive developer guide and easy to use quick start application. Their conceptual doc is also nicely written and easy to understand. We were able to have up and running a quick start sample in web application in less than an hour. Android and iOS SDKs need integration points and configuration and required more time on that front. However, both have easy to use setup on both the front.
  2. Pricing – Every provider has a different pricing model and one has to understand which one will suit him best. Tokbox starts with a flat 9.99$ month with 2000 minutes subscription whereas Twilio has 0.0010 $ per minute/participants pricing model. One should do a clear calculation based on estimated user sessions and should choose the right one. Here is the quite detailed blog post for the same which will give you a good insight for right-sizing the pricing model with various CPaaS providers.
  3. Support – Twilio and Tokbox both have good support available in their backend. If you choose to move to an enterprise plan both will provide dedicated support available for your need. In our experience, we reached out to their support once or twice and got a fast response in support of integration.
  4. Feature Listing – One might need to get the support of different features too while integrating the video chat. Recording, Analytics, Intelligent Insights using AI, Text chat support are the few which might be required in some cases. Please go through with each of them to see what they have to offer.
  5. Extensible – See the extensible part of all the providers. Look for the ecosystem that they have and how can they support you in extending the functionality or any custom development or feature that you need on top of it. In our experience, both have limited extensibility support and do not offer much customization and features that they provide. We wanted to have incoming video call support (Similar to WhatsApp Video Calling) in our app however no out of the box solution was available in both and we had to build it on our own with the support of real-time Push Notification services on Android and iOS both. However, it was not really a deal-breaker for us as the primary requirement was to get something inbuilt in the app in agile and cost-effective way.

While considering the overall perspective we decided to use TokBox due to its super simple and Pay as you go pricing and ease of use.

The above context is based on our experience that we encounter and does not support the promotion of any of the services. Your experience with each of the services might vary. Please feel free to share your feedback and input on the same.  

Photo by visuals on Unsplash

When not to use Microservices

There is nothing like a silver bullet that exists in this world which can solve all your problems. In the field of medicine, certain medications can help in fighting only certain diseases. A medication for headache can not be used for diabetic care or curing eye problems. Certain medications also not suitable in a few cases if you have some other preexisting disease. For example, Paracetamol should not be given in case of fever if you are already having liver dysfunction.

Technology frameworks, design patterns, architectural considerations work in a similar way. A given framework solves a given set of problems, at the same time it might work as an anti-pattern in few cases if you have some other problem statement too at hand.

OK, so lets talk about Microservices a bit..

Microservices has been buzzing in technology from starting of this decade. We as an engineer are always prompt in considering these buzzing frameworks around as it has been used by other tech giants and they have started promoting it. You might hear statements from your engineering team about the same that why dont we use Go/Rust/Julia or any other trending languages or framework. If you ask them to explain why we should use it, chances are pretty high that you will get the same response –

“It is a new trend, people talk about it and it has been open sourced by Google/Facebook etc. My friend is also working on the same..”

We mostly fall in this trap and quickly try to adopt this before thinking much about the core of the situation –

“Will this really solve my problem ? or Do I really have a problem statement which can be solved by this framework/language?”

Microservices are no more exception on this line. It has certain pros and cons as well and one should be pretty clear that they actually have a problem statement that can be solved by microservices. Also, they should be very clear of using the trade off which Microservices has while using it. If you have not introspected this beforehand, this might be a real mess for your engineering team going forward. Yes, you heard it right…

Microservices are a real mess … You should consider this only if you are 100% aware of its pros and cons and ready to handle the downside for the same.”

Evaluation of Microservices

Microservices came into existence at the start of 2010 and started getting adopted by many tech companies rapidly. Netflix has been a big promoter of Microservices and contributed a lot to this front. When I came to know the first time about Microservices in 2011, the first thing that came into my mind was ..

“Ohh… not a much different than SOA and ESB… Indeed a specialized variant of SOA which only works on HTTP and mostly supports JSON data format. Then why so much of buzz around it…”

Microservices was nothing new especially for the people who were working on enterprise architecture at that time, they found it similar to what SOA was providing on the operational front. Indeed Adrian Cockcroft, former director for the Cloud Systems at Netflix, described this approach as “fine grained SOA” back in 2012.

Lets talk about some obvious advantages of Microservices

One of the biggest reasons why Microservices got a big push was the adoption and standardization of communication protocol and data. HTTP and JSON became standardization for system communication. HTTP footprint became so large that even the smallest IoT device started supporting it. This evaluation ultimately killed two things… SOA and M2M protocol and gave birth to a new skill – DevOps.

So lets see what benefits we have for Microservices –

  • Modularity – This makes the application easier to understand, develop, test, and become more resilient to architecture erosion. This benefit is often argued in comparison to the complexity of monolithic architectures
  • Scalability – Since microservices are implemented and deployed independently of each other, i.e. they run within independent processes, they can be monitored and scaled independently
  • Ease of Integration – microservices is considered as a viable mean for modernizing existing monolithic software application. There are experience reports of several companies who have successfully replaced (parts of) their existing software by microservices, or are in the process of doing so.
  • Distributed development –  it parallelizes development by enabling small autonomous teams to develop, deploy and scale their respective services independently.It also allows the architecture of an individual service to emerge through continuous refactoring. Microservice-based architectures facilitate continuous integration and deployment.

Alright, this seems fair enough and now let’s see some of the complexity and cons that Microservices brings on the table. The above advantages might sound fascinating and easy to achieve, however in technology, things which sounds easy are mostly hard to achieve.

What is hard to achieve in Microservices –

  • Cross services transaction – By any chance, if your system requires you to achieve data consistency among different microservices, you will not find it easy or I would say noway to handle this. Thought might come up in your mind to write your own transaction management however cost of writing that is too high as you have to handle lots of things which could have been easily done by one annotation if you would have monolithic in place.
  • Infrastructural Operational efficiency – Deployment, data backup and restore, data recovery becomes really challenging and becomes overhead for your DevOps team. Since each service has its own deployment server and database, DevOps team has to plan separately for scaling, data backup and restore and recovery strategies for each of the microservices. Consider, if you have ton of services running in your production this could be really painful for DevOps specially if you are a very lean engineering team.
  • Cost efficiency of the cloud expenditure – Cloud cost increases as you keep on spawning new servers and keep on pushing the data between your servers. Since each microservice runs in a separate server, this will definitely increase your infra cost even though if you use docker and other orchestration container software. Also, we normally ignore the data in and out cost however if you are moving large data between your servers this might exceed the expenditure by a significant amount. Since in Microservice lots of data is getting passed between systems for aggregation and composition of services, this takes a toll on infra cost if you compare this with monolithic deployment.
  • Testing/Deployment and Debugging – Consider the case that the output given to our client is the outcome of aggregating the response of multiple microservices. If any debugging/testing has to be done, one has to trace it to all the microservices in the production and find out the cause. You have to define the logging strategy first at hand to avoid any mess in debugging the system. As you keep on adding new services this could be really hard to solve the issue and may compromise with the agility of the deliverable.

Ok, so when should we really NOT use this.

  • If you dont have problem statement at hand -Looking at the pros, just check if you really have a problem statement with you. There are other ways of achieving modularity and scalability rather than just going in Microservices way. You can also think of just adopting a lean microservice way (Breaking your monolithic system in just two/three microservice) for greater modularity and scalability handling.
  • If you are early stage startup – Avoid this if you are just starting up and still into the validation phase. You might end up solving the problem which is really not required to be solved at the current stage of your organization. Remember you are not Netflix.
  • You have a very lean team and not distributed much– Microservices work better if you have a distributed team and each team can work independently on each of the microservice. If your engineering team is not in that state you should avoid it or maybe try to adopt a lean model of Microservice which I explained in the first point.
  • DevOps skill is missing in your team – Microservices architecture requires lots of work to be done on DevOps side for deployment and management of infrastructure. One should surely avoid this if you or your team is lacking on this skill or your team is not having much experience in handling the microservice setup.

Final thoughts –

We at HealthKart uses microservices architecture pattern and have gone thorough this evaluation from monolithic to microservices. However below two things have really helped us in handling the downside of this pattern and maximizing the upside.

  • Optimal service decomposition strategy – Dont over do it.
  • Go Slow – Make it in agile way, Develop > Measure > Learn. Start with not more than 2-3 services which are critical for scalability and modularity point of view. This will really help you deciding whether this will work for you or not.

P.S. Above content is outcome of my experience that I have gained while working with microservice and are open to feedback and suggestions.

References – https://en.wikipedia.org/wiki/Microservices

Photo by Dimitri Houtteman on Unsplash

API Gateway- Front Controller to our Microservices

What is an API Gateway?

An API Gateway is the first step towards diving into a microservices architecture. It is a type of proxy server which sits in front of all our backend services and provides a unified interface to the clients.It acts as the single entryway into a system allowing multiple APIs or microservices to act cohesively and provide a uniform experience to the user. 

An API gateway takes all API requests from the clients and handles some requests by just routing to appropriate clients and for some requests it aggregates the various services required to fulfill them and returns the combined response.

Why API Gateway? What benefits does it provide?

As more and more organizations are moving into the world of microservices, it becomes imperative to adapt an API management solution which takes off  the workload of ensuring high availability and performs certain core functionalities.

A major benefit of using API gateways is that they allow developers to encapsulate the internal structure of an application in multiple ways, depending upon use case. Enumerating below are some of the core benefits provided by an API gateway:-

  1. Security Policy Enforcement – API gateways provide a centralized proxy server to manage rate limiting, bot detection, authentication, CORS, etc.
  1. Routing & Aggregation: Routing request to appropriate service is the core of an API gateway. Certain API endpoints may need to join data across multiple services. API gateways can perform this aggregation so that the client doesn’t not need complicated call chaining and hence reduce the number of round trips.Such aggregations help us in simplifying the client by moving the logic of calling multiple services from client to gateway layer. It also gives a breathing space to our backend services by lifting the thread management logic for assembling responses from various services off from there.
  1. Cross Cutting Concerns: Logging, Caching, and other cross cutting concerns such as analytics can be handled in a centralized place rather than being deployed to every microservice.
  1. Decoupling: If our clients need to communicate directly with many separate services, renaming or moving those services can be challenging as the client is coupled to the underlying architecture and organization. API gateways enables us to route based on path, hostname, headers, and other key information thus helping to decouple the publicly facing API endpoints from the underlying microservice architecture.
  1. Ability to configure Fallback: In the event of failover of one or more microservice, an API Gateway can be configured to serve fallback response, either through cache , some other service or a static response.

Solutions Available?

There are a myriad of solutions available when it comes to choosing an API Gateway.  Few renowned ones include – 

  • Amazon API Gateway
  • Azure API Management
  • Apigee
  • Kong
  • Netflix Zuul
  • Express API Gateway

In my view, the primary factors that are taken into consideration while choosing a suitable API gateway are the following:-

  1. Deployment complexity – how easy or difficult is to deploy and maintain the gateway service itself
  1. Open Source vs proprietary – are extension plugins available readily? Is the free tier scalable as per your required traffic?
  1. On premise or cloud hosted – On-premise can add additional time to plan the deployment and maintain. However, cloud hosted solutions can add a bit of latency due to the extra hop and can reduce availability of your service if the vendor goes down.
  1. Community support – is there a considerable community using/following your solution where problems can be discussed.

How did HK leverage API gateway?

At HealthKart we chose Netflix Zuul API Gateway (Edge Service) as the front door for our microservices. We have embedded our authentication  & security validation at the gateway layer to avoid replication on multiple services. We use it for dynamically routing requests to different backend clusters as needed.

Also, we have implemented routing rules and done filter implementation. Say we want to append a special tag into the request header before it reaches the internal microservices, we can do it at this layer.

Netflix Zuul – What & How?

At a high level view, Zuul 2.0 is a Netty server that runs pre-filters (inbound filters), then proxies the request using a Netty client and then returns the response after running post-filters (outbound filters). The filters are where the core of the business logic happens for Zuul. They have the power to do a very large range of actions and can run at different parts of the request-response lifecycle.

Zuul works in conjunction with Netflix Eureka service. Eureka is a REST based service that is primarily used in the AWS cloud for locating services for the purpose of load balancing and failover of middle-tier servers. Zuul doesn’t generally maintain hard coded network locations (host names and port numbers) of backend microservices. Instead, it interacts with a service registry and dynamically obtains the target network locations.

To get this to working on our Edge microservice, spring boot has provided excellent in-build support , we just had to enable few configurations. Code snippet for the same is illustrated below:-

public class GatewayServiceApplication {

 public static void main(String[] args) {
   SpringApplication.run(GatewayServiceApplication.class, args);

At the respective microservice layer, we needed to integrate service discovery so that as soon as the microservice is up – it registers itself with the Eureka server registry. @EnableDiscoveryClient annotation in spring boot helps us achieve this.

The following properties at the client side helped us in enabling client registry:-

eureka.instance.hostname= xxx
eureka.client.region= default
eureka.client.registryFetchIntervalSeconds= 5
eureka.client.serviceUrl.defaultZone=  xxxx


An API Gateway service is a great add-on to have in the micro-services architecture and has definitely proved to be a boon for us. We have still not leveraged it to its maximum capacity and aim to use it for cross cutting concerns like logging, caching, etc in coming months. The end goal would be to have each and every microservice to be on-boarded on this API gateway to enable seamless communication between client to server and server to server.

Real time Analytics Pipeline Using AWS

At HealthKart, we use lambda architecture for building real time analytics pipeline. However the most critical part in this setup is picking the framework which are extensible and does not cost a heavy toll on your infrastructure cost.

Keeping these thing in mind, AWS was the most viable option to have lambda architecture for achieving real time analytics for HealthKart platform. Below is the architectural diagram of the setup that we have which comprises of multiple frameworks to achieve the same and has been explained below.

Lambda Architecture for real time analytics
  • AWS Pinpoint – AWS pinpoint is primarily a mobile analytics framework which also has JS SDK available along with REST APIs. This framework provides API to fire pre-build and custom events from client side which will get stored on S3 buckets in JSON format. Since it has client SDK available, it provides lots of pre-build client matrix like session time, DAU/MAU, geographical information in the pinpoint dashboard. On top of it 100M events are free and they charge 1$ for additional 1 Million event. This really makes this cost optimal if you are events are in few hundred millions per month.
  • S3 Bucket – All the events data which are fired up from client side gets stored in S3 bucket which is scalable and easy to integrate service with other services of AWS.
  • Kinesis Stream – Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. Amazon Kinesis offers key capabilities to cost-effectively process streaming data at any scale, along with the flexibility to choose the tools that best suit the requirements of your application. We use Kinesis to push all events data received from our app in real time manner.
  • Application Groups Listener – These are Kinesis clients which listen to Kinesis stream and powers up parallel processing on streaming data in real time. There could be multiple application groups which run in parallel to process large amount of data. We process these streaming data which is being used in defining the products which are trending in real time, recently viewed history of users, creating personalized result in listing, sending real-time push notification based on event data rule etc.
  • Redis Cluster – Application group listener prepares the required data for trending, viewing history, personalized data etc and put it it Redis cluster. Our platform uses this data stored in redis cluster to show this to users on App/Web in real time. Since Redis has multiple data structure support apart of just key-value pair, it becomes easy to serve different kind of pre-build data based on need in realtime manner.
  • Redshift – AWS redshift posers of analytics workload of petabyte scale of data. We further pass S3 event data to Redshfit so that on-demand and adhoc queries for analytical use can be processed in faster manner for in-house reporting purpose.
  • QlickSense – QlickSense is BI reporting tool which is integrated with Redshift columnar database to power up our business reporting.
  • Athena – Athena can be used to even fire up SQL queries on the data stored in S3 in JSON format for analytical and reporting purpose.
  • QuickSight– Amazon QuickSight is a fast, cloud-powered business intelligence service that makes it easy to deliver insights to everyone in organization. As a fully managed service, QuickSight lets you easily create and publish interactive dashboards.

We also use the same setup to power up user engagement in realtime manner since it is extensible architecture and follow the Open/Closed architectural pattern . Our user journey work flow system listen to same stream to send personalized push notification in real time to users based on his action. We use flowable workflow engines to integrate it with Kinesis application groups for this purpose.

The above content sharing is based on our experience and work here at HealthKart. Please feel free to comment with your thoughts on this.