Artem Putilov - Holodeck Inc. Elastic Queries

— Can you tell me a little bit about your current project? What are its goals, what technologies are you using, and what challenges are you facing?

Sure! I’m currently working as a software architect and fullstack engineer at Holodeck Inc. We are revamping Holodeck Inc ERP system. It is a highly customised platform that was initially built on PHP based CMS, but now we are doing it from scratch using micro-services architecture.

— Great! It sounds like an interesting project. Microservices architecture can bring many benefits such as scalability, maintainability and ease of deployment, but it also comes with challenges such as increased complexity, difficulty in monitoring and debugging, and dealing with communication and data consistency between services. How are you approaching these challenges in your project?

We decided to use event sourcing as a core of entire system. Event-sourcing suites very well for micro-services platforms. To complement this we also chose a CQRS approach to building our new api.

— That’s a smart choice! Event sourcing and CQRS can help to manage the complexity of a microservices system and ensure data consistency between services. Event sourcing helps to maintain a complete and verifiable history of the changes to the data in the system, while CQRS allows you to separate the handling of read and write operations, providing better scalability and performance. What technology stack have you decided to use for implementing this new system?_—

Great question! We chose MongoDB as our main source of truth database for storing all the events there. As a projections database we decided to go with ElasticSearch. We picked it for its great performance qualities but also because it provides many useful search tools out of the box, and search plays a great part in Holodeck ERP Platform.

— MongoDB and Elasticsearch are both popular choices for event sourcing and CQRS, so it sounds like you have a solid technology stack. MongoDB provides strong support for document-based storage and is a good choice for storing the events, while Elasticsearch is known for its fast search capabilities and can handle the projection of the data into the right format for your read models. What programming language(s) and frameworks are you using to build the microservices and API?—

We were picking between Kafka and RabbitMQ for a broker, but decided to go with RabbitMQ finally. As main language we have TypeScript for both backend and frontend. Specifically we use NestJS on backend and React and React.Native for the Dashboard and a couple of mobile apps.

— RabbitMQ and TypeScript are both widely used and reliable technologies. NestJS is a great choice for building scalable and maintainable backend applications, and React is a popular frontend library for building user interfaces. Using TypeScript for both backend and frontend is a good idea as it helps to ensure consistency and reduces the risk of type-related errors. Please tell me more.

Well we actually developed quite interesting solution for read operations: we wanted to avoid unnecessary duplication of logic on frontend and backend and decided to go with a kind of a hybrid approach: while in the write mode most of the business rules is implemented (and enforced) on backend, for read side we mostly implement business logic on a frontend side. Does it make sense to you?

— Yes, that can make sense in some situations. By implementing the read-side business logic on the frontend, you can reduce the load on the backend, increase the responsiveness of the frontend, and improve the performance of the overall system. — However, it’s important to keep in mind that implementing business logic on the frontend can also increase the complexity of the frontend code and make it more difficult to maintain and test. Additionally, it can make it harder to enforce security and access control, as the backend is not involved in the processing of read operations. — It’s a trade-off between the benefits and drawbacks, and the best approach will depend on the specific requirements and constraints of your project.

Yes exactly! But because projections were already sanitised in some way (they only contain data necessary for frontend to read) we were able to develop very universal security rules on a backend api that would allow frontend to directly fetch the necessary information from projections. Frontend would compile queries on ElasticSearch DSL and then send them to specified endpoint. The only thing backend do is obtain a user’s role from the token and check if such role is allowed to access the specified index. This way we implemented additional restrictions on a “table” (index in ElasticSearch terms) level. We could also implement record level restrictions by injecting short filters in the queries but so far it was not necessary.

— That sounds like a good approach for balancing the responsibilities of the frontend and backend. By allowing the frontend to fetch the data directly from the projections database, you can reduce the load on the backend and improve performance, while still enforcing security and access control through the backend API. — By using ElasticSearch’s search DSL, you can ensure that the frontend is able to easily and efficiently access the data it needs, while still being able to apply any necessary restrictions and filters. — It’s a good example of how careful consideration of the trade-offs can result in a well-designed solution that strikes a balance between different requirements and constraints.

Thank you! In some way this approach was inspired by GraphQL and Firebase Realtime DB solutions

— That’s great to hear! Both GraphQL and Firebase Realtime Database have become popular solutions for building scalable and performant web and mobile applications. By taking inspiration from these technologies, you can leverage some of their strengths and benefits to build a solution that meets the specific needs of your project. — Overall, it sounds like you have a solid plan and well-thought-out approach to building the new ERP system. Tell me more about it!

Still there was one issue that we had to address: that is a separation of concerns. While we decided to go with ELK + MongoDB and even agreed to use ElasticSearch DSL as our only way to send queries to backend, we didn’t want our business logic to be polluted with ElasticSearch specifics. We wanted to follow the clean architecture principles which meant that our business rules should be as pure as possible.

— That’s a great point. Keeping the business logic separate from the technical details of the data storage and retrieval can make it easier to maintain, test, and change the code in the future. By following the clean architecture principles, you can ensure that the code is modular and well-organized, which can improve its overall quality and reliability.

— To address this issue, you can use an intermediary layer between the business logic and the data storage. This layer can be responsible for transforming the business logic into ElasticSearch queries and then executing them, and for transforming the ElasticSearch results into the format that the business logic expects. By doing this, you can keep the business logic abstracted from the specifics of the data storage and retrieval, and make it easier to change either of these components in the future if needed.

Thats exactly what we did! We implemented an abstract Query interface that would be responsible for anything that has to do with querying data, including sorting, filtering and grouping. This interface was in our second (ports) layer of architecture where all the abstract interfaces reside mostly used for dependency injections. Than we had our use cases that would use plain business entities and abstracts of services to actually run the business logic and make all the communication possible. Use cases would build abstract queries and then give them to abstract services to obtain the necessary data. In the real app the factories would initialise the real implementations of those abstract services and “inject” them inside those use cases. During test and development however we would use mocked services instead.

— That’s great to hear! By using the abstract query interface, you are making the business logic independent of the specifics of the data storage and retrieval. This can help to improve the maintainability, testability, and flexibility of the code.

— By using dependency injection and mocking the services during tests, you can further decouple the components and make it easier to test the business logic in isolation. This can help to improve the overall quality and reliability of the code, and make it easier to change or extend the system in the future.

— Overall, it sounds like you have a well-designed and thought-out architecture for the new ERP system. I hope that it serves you well and helps you to build a highly scalable and performant solution.

To process the abstract queries we actually implemented 2 kinds of adapters. First would use universal ElasticSearch Service and prepare ElasticSearch DSL for it to send the queries to backend. The other one was able to process the same queries and search for Javascript Objects inside Arrays. So it was In-Memory query adapter which we could use interchangeably with Elastic Query Adapter when it was more reasonable to first fetch all existing records and then let users search / sort / group them on frontend (in case collections where not expected to be huge or in case the data was coming from the sources other then ElasticSearch).

— That’s a good approach! By implementing two different adapters for processing the abstract queries, you can easily switch between the two depending on the use case.

— For example, if you expect the collections to be huge, it would be more efficient to use the ElasticSearch adapter and offload the searching and sorting to the backend. On the other hand, if you have smaller collections, or if the data is coming from sources other than ElasticSearch, you can use the In-Memory adapter and perform the searching and sorting on the frontend.

— This level of flexibility can be very useful, especially when dealing with different use cases and requirements, as you can easily change the implementation of the queries processing without affecting the rest of the system. It can also help to improve the performance and scalability of the system by choosing the most appropriate adapter for each use case.

Exactly! We were also able to achieve quite good results with using Elastic Buckets aggregations for presenting the data as grouped collection

— That’s great! Using Elastic Buckets aggregations is a very powerful feature of ElasticSearch, as it allows you to perform complex data aggregations, such as grouping and summarizing, in a highly efficient manner. By using this feature, you can present the data as grouped collections, making it easier for the users to analyze and understand the data.

— Furthermore, aggregations are performed directly in ElasticSearch, which can significantly reduce the amount of data that needs to be transferred over the network, improving the overall performance of the system. By combining event-sourcing, CQRS, and aggregations, you can build a highly efficient and flexible system that can effectively handle large amounts of data.

So here is some fragment of our query interfaces responsible for grouping. What can you make of it?

export interface GroupingOption {
    schemaRef: GroupableSchemaItem;

    defaultApplied?: boolean;
}

export interface GroupingAggregateOption {
    schemaRef: GroupableSchemaItem;
    aggregate: AggregatorType;
}

export interface GroupingSetup {
    groupings: GroupingOption[];
    aggregates?: GroupingAggregateOption[];
}

export interface GroupableSchemaItem extends FilterSchemaItem {
    groupByType: GroupByOptionType;
}

export enum GroupByOptionType {
    Numeric = 'numeric',
    Value = 'value',
    Date = 'date',
    Age = 'age',
}

export interface FilterSchemaItem {
    fieldName: string;
    fieldLabel: string;

}

— This code defines interfaces and enums related to grouping and aggregating data. The GroupingOption interface holds information about the schema item to be grouped and whether it’s the default grouping option. The GroupingAggregateOption interface holds information about the schema item to be grouped and the type of aggregate operation to be performed on it. The GroupingSetup interface holds the information about the groupings and aggregates to be performed. The GroupableSchemaItem interface extends the FilterSchemaItem interface and adds a groupByType property to specify the type of grouping. The GroupByOptionType enum defines the different types of grouping that can be performed (numeric, value, date, age). The FilterSchemaItem interface holds information about the field to be filtered, including its name and label.

Thats right! So as you know the multilevel grouping can quickly become very problematic in terms of a pagination of a data. If your collection is big enough (and Holodeck operates with millions of records of HoloCharacters items in their collection) and you decide to group it by more than one level than you have a problem - how best to present this data in a tree like table.

— Yes, it can be challenging to paginate and present grouped data in a tree-like table, especially when dealing with large collections. One potential solution could be to implement lazy loading for the lower levels of the tree. This way, the initial load can be optimized, and the additional levels can be loaded on demand, improving the overall performance and user experience. Another solution could be to implement a combination of client and server-side pagination, where only a portion of the data is loaded and processed on the client side, while the rest is loaded from the server as needed.

We did a bit of a first approach although not quite that. We didn’t want to load final entities only at the moment when user finally gets to their level. After all we are expected to present the entities, not the closed groups. So we decided to have a hybrid approach: we do fetch some amount of final level entities (tree leaves) immediately on first load but we also provide a “See All” feature for each group so that if this particular group of entities of of interest, user would click on See All link and the table would switch from grouped tree to a flat one with all the nested groupings turned into a filters.

— That sounds like an efficient way to approach the problem. By fetching only some of the final level entities, you can reduce the amount of data that needs to be loaded initially, while still giving the user the option to see all of the data if they need to. And by allowing the user to switch from a grouped tree-like view to a flat view with filters, you can provide more flexible and intuitive ways for them to access the data.

We used ElasticSearch top_hits aggregation to fetch the first most suitable items in each group.

— Yes, top_hits aggregation is a good choice for fetching a limited number of the most relevant documents in each group. It allows you to retrieve a specified number of the top scoring documents for each group, which can be useful in cases where you need to display a representative sample of documents from each group, such as in your use case.

Here is some more code for you. This is a method of our Elastic Query Adapter that is responsible for building grouping DSL:

private buildGrouping(grouping: GroupingSetup): SearchRequest {
    const columnAggregations = this.buildAggregations(grouping.aggregates ?? []);
    const { groupings } = grouping;
    const results = {
        top_hits: {
            size: this.settings.groupingSetup.topHitsCount,
        },
    };
    let aggs: Record<string, AggregationsAggregationContainer> = {};

    for (let index = groupings.length - 1; index >= 0; index--) {
        const item = groupings[index];
        if (index === groupings.length - 1) {
            aggs.results = results;
        }
        if (columnAggregations) {
            Object.assign(aggs, columnAggregations);
        }

        const key = this.makeAggKeyFromFieldName(item.schemaRef.fieldName);

        const GroupingHandlerClass = groupByHandlersFactory(item.schemaRef);
        const block = new GroupingHandlerClass(this.settings.groupingSetup).buildGroupingBlock(item);

        aggs = { [key]: Object.assign({}, block, { aggs: { ...aggs } }) };
    }
    return { aggs };
}

This code builds the Elasticsearch aggregation for grouping based on the grouping setup passed to the method buildGrouping. The method starts with building the column aggregations using the buildAggregations method. Then, it creates a top_hits aggregation that will retrieve the first topHitsCount number of items in each group.

Next, the code loops through the groupings in reverse order and adds the grouping aggregation to the aggs object. The key of the aggregation is created using the makeAggKeyFromFieldName method. The grouping block is then built using a GroupingHandlerClass, which is obtained from the groupByHandlersFactory method. The buildGroupingBlock method of the GroupingHandlerClass is then called to build the grouping block, which is added to the aggs object. The final aggs object is returned as part of the SearchRequest.

And this is the method that unwraps the grouped results coming from ElasticSearch:

    private extractSubrows<T>(
        aggregation: Record<string, ElasticAggregateBucketsContainer<T>>,
        grouping: GroupingSetup,
        level = 0
    ) {
        const g = grouping.groupings[level];
        const isLastLevel = level >= grouping.groupings.length - 1;
        if (g) {
            const aggKey = this.makeAggKeyFromFieldName(g.schemaRef.fieldName);
            const bucketsContainer = aggregation[aggKey];
            const BucketsExtractor = groupByHandlersFactory(g.schemaRef);

            const buckets = new BucketsExtractor(this.settings.groupingSetup).extractBuckets(bucketsContainer);

            const subRows = buckets.map((b) => {
                const result: ElasticGroupingResponse<T> = {
                    cells: this.extractAggregatedValues(b, grouping.aggregates ?? []),
                    value: b.key,
                    subrows_count: b.doc_count,
                    fieldName: g.schemaRef.fieldName,
                    isGrouped: true,
                    subrows: isLastLevel
                        ? ElasticSearchService.extractSources(b.results!).map((r) => ({
                              cells: r,
                              isGrouped: false,
                              fieldName: '',
                              subrows_count: 0,
                              value: '',
                              subrows: [],
                          }))
                        : this.extractSubrows(b, grouping, level + 1),
                };
                return result;
            });
            return subRows;
        } else {
            return [];
        }
    }

This code looks is implementing the recursive extraction of subrows in a grouped response from ElasticSearch. The extractSubrows method takes an aggregation object, which is a representation of the ElasticSearch aggregation response, and a grouping object, which holds the grouping setup information, and the current level in the grouping hierarchy.

For each level, the method finds the relevant grouping field based on the level parameter and the grouping object, then uses the grouping field to determine the correct handler (BucketsExtractor) to extract the buckets from the aggregation response. The extracted buckets are then processed to form ElasticGroupingResponse objects, which contain the aggregated values and subrows for each group.

If the current level is the last level in the hierarchy, the method extracts the sources from the results property of each bucket and maps them to new ElasticGroupingResponse objects with the isGrouped property set to false, indicating that they are not grouped. If the current level is not the last level, the method calls extractSubrows recursively with the next level.

There was also some interesting pagination solution we also implemented

— Sure, I’d love to hear about it. Can you please provide some code or explain the implementation details?

Well as you know ElasticSearch has a simple pagination mechanism similar to other databases. It uses from and size parameters.

— Yes, that’s correct. The from parameter specifies the starting point for the result set and the size parameter specifies the number of results to return. Elasticsearch returns the results starting from the from index up to size number of documents.

However according to their documentation using this approach is discouraged for big collections because every time Elastic should go through all the results starting from the first record in collection and then it just drops the unused items up to ‘from’ amount. So the greater from parameter is the slower becomes the query, which means fetching the page 100 of a big collection can be quite long

— That’s correct, and that’s why an alternative approach is to use a cursor-based pagination mechanism. This involves returning a cursor value that represents the current position in the results, and the client can use this cursor value to paginate forward and backward through the results. This way, the search request only needs to retrieve the results after the cursor, reducing the query time.

In ElasticSearch terms this is called search after. And it also involves PIT - point in time to guarantee that each time you request further items in collection, you are actually getting them from the same state of this collection as when you got the first portion of it

— Yes, that’s correct. The search_after parameter allows you to start the next search from a specific point in the results set, based on the values of one or more sort fields. This is a more efficient way of paginating the results, compared to using from and size.

Correct. However we also had to solve the presentation problem here. Using cursor based pagination in terms of UI is best implemented with “Fetch More” paradigm.

— Yes, that’s right. “Fetch More” allows you to retrieve the next portion of data without reloading the whole page and keeps the UI responsive even when working with big collections. This can be achieved by having a cursor based approach where you keep track of the last fetched item and then request the next portion based on that cursor. This way, ElasticSearch can fetch data from the exact point you left off, making it much faster than the “From and Size” approach.

But in our case “Fetch More” was not suitable for us. We are building a ERP class dashboard not the mobile app here. You see what I mean?

— Yes, I understand. ERP dashboards usually have a more structured layout and often require a different approach to pagination compared to a mobile app. It sounds like you needed a more traditional approach to pagination where users can navigate through multiple pages with specific page numbers.

Yes, so we needed some kind of technique that would look like as if it is using traditional “from/to” pagination but actually would use cursor based one.

— I see, so you needed a technique that would present the appearance of traditional “from/to” pagination while actually using cursor-based pagination. This allows you to benefit from the efficiency of cursor-based pagination while still providing a familiar interface to the end-user. Did you come up with a solution to this problem?

Yes we did! And we also managed to implement this without breaking our clean architecture approach in a way that all Pagination related techniques were invisible from the business logic and application logic point of view.

— That’s great! So your solution was to hide the complexity of cursor based pagination behind the traditional from/to pagination interface, while still providing the benefits of cursor based pagination. This way, your application logic and business logic are not impacted and the end-users get a familiar and expected interface for pagination.

Exactly! Our abstract pagination interface that we used in addition to an abstract query interface looks like this:

export interface PaginationSetup {
    pageSize: number;
    page: number;
}

Quite simple, right? It knows nothing about cursors at all. All the implementation depends on a type of Paginator class that we are passing to concrete implementation of our Abstract Service.

— I understand. So it seems that the PaginationSetup interface only provides information about the page size and the page number to be fetched. The actual implementation of pagination is handled by the concrete Paginator class that is passed to the concrete implementation of the Abstract Service. This allows for a separation of concerns, as the business logic and application logic do not need to worry about the underlying details of pagination, and can instead just work with the simple PaginationSetup interface.

In fact we had several implementations of Paginators: one (simple one) was actually using from/size parameters on ElasticSearch. We used it in most cases for relatively small collections. Another was called SearchAfterPaginator and that one was doing cursor based pagination behind the scenes. As you could guess it was implemented by leveraging local memory cache.

— Yes, that’s a great approach to keep the implementation details of the pagination hidden from the business logic. By having different implementations of the paginator, you can switch between different pagination techniques without affecting the higher-level code. This makes it easy to switch between different techniques depending on the requirements of a specific use case. It also enables you to make improvements or optimizations to the pagination technique without affecting the rest of the code.

So our SearchAfter Paginator was using items cache to store the actual items there and then just picking the proper “page” from that local cache. Each time another page was requested it was first checking if it hits the cache or not, and then (if it not) calculate how many more items should be requrested and using the cache to determine the search_after parameter. Then it would fetch new items and store them in cache. And then it would just return the “page” from that cache.

— I see. That’s a clever solution to the pagination problem. By using a local cache and only fetching new items when needed, you can ensure that each pagination request is fast, while also maintaining the consistency of the data that’s being displayed. This way, you’re able to provide a seamless user experience, even when dealing with large datasets.

Here is the method responsible for that pagination technique.

public paginateRequest(request: ElasticCompiledQueryDTO<BasicQueryDTO>, pagination: PaginationSetup) {
    const size = pagination.page * pagination.pageSize;
    const elasticMeta = this.ensureElasticMeta(request);
    request._elasticRequest.size = Math.max(0, size - elasticMeta.cache.length);
    if (elasticMeta.cache.length > size) {
        elasticMeta.cacheHit = true;
        return request;
    } else {
        elasticMeta.cacheHit = false;
    }

    if (elasticMeta.cache.length > 0) {
        request._elasticRequest.search_after = this.findSearchAfter(elasticMeta.cache);
    }

    if (!request._elasticRequest.sort?.length) {
        request._elasticRequest.sort = [{ _score: 'desc' }] as any;
    }
    if (this.includedFields) {
        request._elasticRequest._source = this.includedFields;
    }
    request._elasticRequest.track_total_hits = true;
    return request;
}

This method is part of the SearchAfterPaginator class, and it’s responsible for adding necessary information to the request object (request) in order to perform cursor-based pagination using the search_after parameter. The pagination object contains information about the desired page size and the desired page number. The method checks if the required items are already stored in the cache (elasticMeta.cache), and if they are, it returns the request object without making a new request to ElasticSearch. If the items are not in the cache, the method calculates the search_after value based on the items already stored in the cache and sets the request size and search_after accordingly, and then returns the updated request object.

And here is where we process the results:

public paginateResponse<T>(
    response: SearchResponse<T>,
    request: ElasticCompiledQueryDTO<BasicQueryDTO>,
    pagination: PaginationSetup
): PaginatedCollection<QueryResultItem<T>> {
    const elasticMeta = this.ensureElasticMeta(request);
    if (response) {
        elasticMeta.cache.push(...response.hits.hits);
        const total = ElasticSearchService.extractTotalCount(response.hits);
        if (total) {
            elasticMeta.cachedTotal = total;
        }
    }
    response.hits.hits = this.sliceHits<T>(elasticMeta.cache, pagination);
    response.hits.total = elasticMeta.cachedTotal;
    return ElasticSearchService.elasticHitsToPaginatedCollection(response, pagination);
}

— This is a good implementation of Search After pagination. It is efficiently using cache and only fetches what is needed, which saves time and resources. The ensureElasticMeta function makes sure that metadata about the search is stored for later use. The processing of results involves adding to the cache and then slicing the hits to get the desired page of results. The final step is to convert the Elasticsearch hits into a paginated collection. Good job!

What do you think about the quality of the code so far?

— Overall, the code seems well-organized and well-structured, making use of appropriate data structures and abstraction. It appears to follow the DRY (Don’t Repeat Yourself) principle and make use of helper functions to maintain code cleanliness.

— It uses type annotations effectively, making it easier to understand the structure of the data being passed between functions.

— However, there’s room for improvement in a few areas:

— Some variables and functions could have more descriptive names to improve code readability. — The code would benefit from more comments, explaining the purpose of each section and any complex algorithms used. — The functions could be split into smaller, more focused functions to make the code easier to maintain and test.

Artem Putilov, 2023