What would be the best way to create / save / load a large dictionary?

MarkusWinter · 26 October 2020 21:43

Imagine you have a large amount of key value pairs, like this:

240 nm: 0.001
241 nm: 0.003
242 nm: 0.012
…
400 nm: 0.002

Note that there is more than one such set of key value pairs (eg for ssDNA, dsDNA, ssRNA, dsRNA, etc) - it’s basically scientific data that are used to calculate certain properties and plot those on screen.

Currently I create dictionaries on start-up and set the values in an init method (and I need to use 3 init methods as I exceed the character limit that a method can use). I’m wondering if this is the best way to go about it.

Any ideas?

C-8DO · 26 October 2020 21:52

Why don’t you put the key-value-pairs in text files and read those at start-up? You’d only need a short helper method to read those files…

DaveS · 26 October 2020 21:53

an SQLite database by far… especially if there could be duplicate keys

npalardy · 26 October 2020 22:15

uh pardon ?
these have to be enormous then ?
any of the suggestions should work
constants
text files
sqlite database

MarkusWinter · 26 October 2020 23:27

Because then I need to turn those text files into dictionaries which is easy enough but the overall time it takes is noticeably longer. That’s why I put them straight into a dictionary.

MarkusWinter · 26 October 2020 23:31

There aren’t duplicate keys within a dictionary but the keys are all duplicates between dictionaries (the keys are wavelengths of light, the values are absorption at that wavelength and are specific to each substance).

Correct me if I’m wrong but database access would be very much slower than creating and accessing a dictionary?

npalardy · 26 October 2020 23:32

an in memory db might surprise you

MarkusWinter · 26 October 2020 23:33

You know those tomes of physical data, like endless tables of absorption spectra for different substances? That’s what I’m dealing with …

I guess what I’m asking is “is there a way to save and load a dictionary that would be faster than creating it from scratch each time?” (and yes, they become properties of a module so I only have to do it on startup).

npalardy · 26 October 2020 23:37

Faster is “dont do it at all”
Other than that there arent a lot of options - they will all take some time

I’d create a db - once - then just use the heck out of that wherever you need it so you literally never need to load things again
And that DB could be quickly read into memory to improve speed - of course if you set the cache sizes large enough it may al bee in memory anyway

DaveS · 26 October 2020 23:49

I think you will find a DB to be way faster… and the larger the amount of data the relatively faster it will be… and as Norman said… create it once… in-memory would be fastest (you can use the SQLite Backup command to load/save it to disk), But depending, you might find direct access could be fast enough

samRowlands · 27 October 2020 00:23

I second Dave and Norman, an SQL database, especially an in memory SQL database.

My other suggestion would be to switch over to a NSDictionary and utilize the built-in OS functionality to load and save binary plists.

Edit: NSDictionaries only support strings as keys, this may or may not cause a problem.

anic297 · 27 October 2020 12:52

Have you tried with BinaryStream? You can avoid converting the values to/from text by writing the values directly. But I don’t know how long it would take in your case.

HalGumbert · 27 October 2020 16:52

Not sure if you can load an SQL Dump file into an in memory db. That could be fast to load.

Is there a way to save a Dictionary as a MemoryBlock and back? Trying to think outside the box…

Another idea would be store it as JSON which acts like a dictionary with many ways to load the json via Xojo, MBS, etc.

This is a good question as I need to implement typeahead with a ton of values…

Karen · 27 October 2020 17:11

That data looks like a UV/Vis spectrum.

-Karen

MarkusWinter · 27 October 2020 18:48

Karen · 27 October 2020 19:41

Given you have and orders list of wavelength/Absorbances with a known wavelength resolution (the Delta for each wavelength), What would you even use a dictionary? Access for any wavelength or wavelength range would be faster using an array as you can calculator the offsets you need

If speed is the issue, particularly for saving and loading, I would consider using memoryblock subclass with the subclass methods doing the point access.

That way contents can be save/ loaded as strings to/from a file or DB with minimal overhead.

-Karen

jmadren · 28 October 2020 15:34

Slower? Technically yes. But how much slower is the question. I use SQLite dbs for “lookup” tables with million+ rows, used in dozens of calculations in one method (each calculation does a separate lookup), and it all runs in less than a second.

So, the real question is how often/how many times are you looking up a value? Will a few more nanoseconds per lookup really cause an issue?

npalardy · 28 October 2020 19:41

Dont forget to add in that with a db accessing many many rows that are “related” can be a one time query where in a dictionary its a LOT of queries to the dictionary one ket at a time

If you needed all the values between 240 nm and 300 nm you can get that in one query

IF the db is in memory then its damned fast
And depending on how Xojo implemented their dictionary it could be faster
Would really have to test that to be sure though

Walter · 28 October 2020 20:12

It depends on your specific use cases, of course, but my general feeling is I doubt it. There are probably very few applications involving “retrieve some data and plot it” where SQLite isn’t more than fast enough. There are innumerable applications doing that sort of thing with SQLite and I doubt many of them have ever found SQLite to be a bottleneck. Xojo dictionaries aren’t especially fast, so it wouldn’t surprise me at all if you found SQLite to be faster.

MarkusWinter · 28 October 2020 21:11

I can see where a database would be faster in retrieving lots of values (rows), but that is not the situation.

I would for example have a DNA sequence (a string that can be quite long) and need the absorption value for each “double” (2 chars) in the sequence (string), eg

DNA: GATCGAT
Values needed for: -G GA AT TC CG GA AT T-

with calculations then performed with those retrieved values.

Now imagine having a few hundred thousand or even millions of DNA pieces (ranging in size from just a few letters to (much less common) a few hundred). And yes, I’m speeding the calculation up by counting how many of each double chars are there in longer sequences (though that slows down the calculation if I do it for all fragments as most are short).

Now do the equivalent with another dictionary for a hundred thousand proteins (which use an alphabet of 20 chars) and their fragments (a few million).

I remember being told that a dictionary is the fastest way of looking up key value pairs once, so that’s why I’m using it.

I still have to test as I was busy elsewhere but I do not think a database could be quicker, though it might be quicker to load the dictionary from a database rather than create it on each app start.