Sunday, May 10, 2020

TinyDocumentDB - part 3 - querying for data (the plan)

Alright, so this is completely new ground for me. I've never written any kind of dynamic query function for real before. At least not where I also had to design the indexing functionality. I'm writing this part of the test even before I've started to write any code at all since I need to get a grip of the requirements here.

Later on, I do need to add some kind of schema support to divide different types of entities from each other. At the moment, I'm just going to ignore that fact and treat the data as a single entity. From a tech perspective, it's the same thing if it's one or multiple entities in the database.

The plan for upserting data

Let's start out with a simple definition of what we should accomplish:
  • As data flow into the document database (as an INSERT or as an UPDATE, knowns as an  UPSERT), we need to pick up the values of certain fields and store them in some kind of index with a reference back to the original data. (hmm, what about a feature to be able to search historical data).
  • When we search for data, we will query this index and return results based on that.
  • It will not be optimized in any way, just make it work.
Straight forward enough.

The diagram

A plan is nothing without a diagram.

The step-by-step plan for storing data (post diagram)

After the diagram, we have enough to create a step-by-step plan:
  1. On the left-hand side, we have data, we pass that data to the core-client.
  2. Store the data as fast as possible in storage.
  3. The core-client reads any index definition associated with the type of data (we only have one type... data...) and extracts the data from the fields that we define in the index definition.
  4. Read the index-file from storage, update it, and write it to storage.
Of course, from a perspective point of view, this will perform terribly under pressure. But we'll get to that later on.

The plan for querying data

I admit, I haven't given the query language details any thought at all and it will not look like this later on. 

But I do think that it's important to have a GET query mechanism available as well as a POST query mechanism. The GET variant will be easier to use ad-hoc and the POST can offer some more fine-grained usage.

The step-by-step plan for querying data (post diagram)

At a top-level, we would do something like this.

  1. Get the query to the Core.Client.
  2. Read the associated index file based on the entity type (we only have one)
  3. Parse the index file for search results and store any unique document keys (the index file could also contain more information on where in the document the information is found and such perhaps?).
  4. Read all the data from storage and return it.
Straightforward enough. Now I have a rough plan to get started with the implementation.


So achieved so far: (blog post part number within [x])

  • [2] A REST API for reading and writing
  • [2] InMemory or File-system reader writer
  • [2] CI/CD pipeline using Github Actions
  • [3] A plan for indexing data and for querying it

New ideas/questions:

  • [2] How to handle schemas without making it complicated?
  • [2] Schema stitching?
  • [3] Storing more data in the index?

Next up:

  • Querying of data - simple indexing implementation

No comments:

Post a Comment