Asynchronous DB queries

bgrommes · 3 April 2023 16:15

Anyone had any experience doing async DB queries?

Generically, what I need to do is the following:

Execute a synchronous call to a stored proc that returns an int ID
Execute one or more less time-critical and generally less resource-intensive calls to other SPs that take that Id as input. These can / should run asynchronously, they are basically “upsert” operations that can be fire-and-forget so far as the caller is concerned, although, any errors should be logged.

Basically a name/address is created in step 1 and step 2 associates optional phone #s, email addresses and other components to that name/address combo.

I am using .NET with Sql Server but I’m more interested in the practical problems you’ve encountered in these kinds of applications than in the specific details of a solution. I will have to decide between async / await syntax and more of a Task.Run(delegate) or parallel approach (time slicing vs true independent threads).

In the past this has worked okay with everything synchronous but I’m looking for opportunities to improve overall throughput and these operations seem like a good opportunities. We are talking processing maybe 200 entities per second so I worry about enough of a work backlog developing that some sort of wall will be hit where the thread pool is full. That’s fine if it degrades gracefully to synchronous throughput until it can catch up. It is not so great if everything tanks and the whole parade stops.

Historically I have tended to really follow the KISS principle with this kind of thing, to the point of having other command line or Windows service processes that I can feed work to via a queue table, vaguely similar to Xojo’s approach to “multi-threading”, for such things. But it is time for me to fully explore my options here.

Any thoughts / experiences anyone has to share as to a general architectural approach / rules of thumb / antipatterns would be much appreciated.

DaveS · 3 April 2023 17:30

If I understand correctly…

Step 1 will retrieve a series of ID numbers, and you want step 2 to operate in parallel with step 1 as those ID number are retrieved???

Why not KISS it, and create a JOIN query so it does BOTH at the same time

SELECT a.ID,b.otherData
FROM id_table a, otherTable b
WHERE a.ID = b.id

obviously your real query would be more complex

bgrommes · 3 April 2023 17:53

The first SP takes as input a list of 1 or more business names and a business location (address), locates existing name(s) and address or creates them if they don’t exist, and assigns an id to the result.

The other calls just add ancillary data to that (new or existing) entity, e.g. a list of phone #s will be added as needed if they aren’t already there. Same for a list of email addresses and other things (contact names / titles, user notes, etc)

No I am not looking for parallelism with step 1 since the other steps need the ID returned by the first step. However I do not want to wait for a theoretically open-ended list of ancillary data to be appended to the name/address before I go on to the next name/address. The context would be a busy processing day where maybe a two or three hundred data sets of various sizes are processed in as close to real time as can be managed.

So once I establish the entity identifier I would not want to block waiting for these accessory operations. They can complete with whatever latency is required.

In practice, I expect the first operation to be relatively pricey, the others are all small-ish and of similar size so they can run in parallel with each other and there’s no required communication between them.

So what I am hoping is that let’s say the first operation takes 0.25 seconds, and I have 3 operations that would take 0.1 seconds each, so they would take 0.55 seconds synchronously and I want to get that down close to 0.35 seconds total.

Just to make it easy for me to get my brain around it I’m assuming that by the time the first operation completes for the 2nd time, the ancillary calls from round 1 will all be done and we’ve save up to 0.2 secs not waiting on them, and the whole process repeats.

Obviously in the real world, at times there could be system loading such that these secondary calls could fall behind a bit.

I’m new to async work in a batch environment so my natural impulse is to want to put all the subsidiary operations into a work queue DB table and have a different process work through and log them. But that’s more overhead for the work table in exchange for more detailed control and better recovery ability. For example with a work queue, let’s say one of the phone #s doesn’t get recorded. I could query the work table for incomplete work with associated errors (let’s say it’s a field overflow) which I can then correct at leisure. With a straight up “fire off a bunch of tasks” approach I’d still need to capture and log any exceptions but I’d not have the overhead of inserting and updating a work queue table plus the general overhead of a separate service. There would just be log entries with exceptions only.

The devilish thing is that if you go async you are adding complexity (possibly a lot of it) in exchange for what you hope is better net throughput under load. But it sort of amounts by definition to premature optimization, which is the main reason I’ve resisted it until now. The other reason is that by all accounts, it’s hard to contain async code to certain critical paths; it’s a little like null propagation, it tends to bleed out into the whole system, which now wants to be fully async everywhere as it tends to get rid of deadlock problems and other issues. That tends to make the system conceptually more complex, even after you get used to the paradigm. Never a good thing in my experience.

So this represents a confined spot where I can maybe add some degree of parallelism without it blowing up in my face

thorstenstueker · 8 April 2023 10:00

The dangerous Part is that you may have also differences between different Databases and you may have changes in between. It is always dangerous. So you might take a look for mechanisms that your Database is not changing in between if that is needed and latest that would cost performance in the network.

If that is not needed (for example for seeking in records which are not changing anymore like bookkeeping records which are for documentation and their changing is forbidden) there would not be any problem.

I prefer often to do that inside of the database but - like I said - depends on database and its capabilities. A risk it is in all possible cases for the performance. And of course it will blow in your face when it costs performance. ou have to decide if there is another way if it costs too much performance.

In most cases you may find a Solution to do.

bgrommes · 8 April 2023 18:14

The probability of two processes updating the same entity are very low but I do have to think about it.

If I actually do it, it will probably end up being a synchronous operation for step 1 and task parallelization for the subsequent subtasks (if there is more than 1 of them, which honestly will be so infrequent at first that I might well elect not to even bother).

I can usually talk myself out of this kind of thing. After all, the old system was synchronous all the way (apart from Sql Server parallelizing query plans and steps internally and of course doing things in parallel on different connections). We even got away with most queries using the (either loved or dreaded) NOLOCK hint since it’s mostly just a reporting system. We have to avoid dirty reads when adding or updating related records of course.

Since I had acceptable performance on the old system and the new system has a more efficient schema to begin with, I am likely going to treat parallelizing of queries as the premature optimization that it probably is.

Sometimes the best implementation is no implementation at all

thorstenstueker · 8 April 2023 18:35

When building a lockable you may not have any problem with async works.

bgrommes · 10 April 2023 17:01

My decision on this is to not implement this until there’s an actual throughput bottleneck that is measurable / reproducible and which it could then be empirically demonstrated that the tradeoffs / extra complexity are worth the performance gains.

When I actually examined my available notes from the old system, the ancillary upserts that could be done in parallel consist of phone #s, email addresses, contact names/titles, and user notes. In our initial implementation we won’t do contacts or notes this time around, and email addresses tend to be thinly populated, so what we end up with in practice is just 1 or more phone#s about 50 to 60% of the time. I would have both phone #s and email addresses overall about 1% of the time so the “lift” from parallelization will not be substantial overall.

FWIW from what I can tell, the best implementation strategy in my case, were I to do this, would be something along the lines of (in pseudo C#):

DoTheSyncrhonousUpsertOfNamesAddresses();
var taskList = new List<Task>();

if (em.PhoneNumbers.Count > 0) {
  taskList.Add(DoThePhoneNumberUpdates());
}

if (em.EmailAddresses.Count > 0) {
  taskList.Add(DoTheEmailAddressUpdates());
}

Task.WhenAll(taskList);

In reality I should determine if there is more than 1 task and if not just run the 1 task synchronously anyway, which is yet more setup overhead given the actual rarity of more than 1 task.

The KISS principle wins again