Threaded Mode | Linear Mode

jte · 09-21-2023, 01:27 AM

(09-18-2023 06:48 AM)komame Wrote: ⋮ (lots of good stuff)
⋮
That's why I mentioned that to apply your suggestion with binary encoding, two dictionaries would need to be maintained (even before the program is run).

If accents are wanted to be kept in a single dictionary, it would seem that stepping up to 6 bits per character could work (with 64-bit ints, this would then allow 10-character blocks to be processed at once). (Optimizations of this sort -- picking a dense encoding -- certainly limits applicability...)

If matches are to not distinguish accents, arranging the encoding so that bitwise operations may still be used seems to be possible. Here is a possible encoding of the sort I'm imagining:

Code:

E: 0

È: 2

É: 3

Ê: 4

Ë: 5

A: 8

À: 9

Â: 11

I: 12

Î: 13 

Ï: 14

O: 16

Ô: 17 

Ö: 18 

U: 20

Û: 21

Ü: 22

C: 24

Ç: 25

-: 26

B: 27

D: 29

F: 31

⋮

Letters with accents are arranged into blocks so that upper bits may be used to mask for matches. (To mask for 'E's and allow for accents a mask of #111100b would be used instead of #111111b.)

There still is a fair amount of latitude in the encoding. To minimize the number of translations during encoding and decoding, above I'm implying that A,B,...Z would be shifted to 26,27,...,51. (Although A eventually ends up at 8, as there are accented variants.) For amusement, the table above has the accented E characters placed so that a common shift [of 198] is employed for them (to and from Unicode codepoints) and for the accented U characters.

(Hmmm.... looking over the above now, though... I cooked up the above by looking at convAccents(), but now I notice there isn't a "Ù" in there, but I do see "Ù" in dictionary words... So, dear reader, please make the necessary adjustments to this post :-&.)

(09-18-2023 06:48 AM)komame Wrote: ⋮
In the case of binary encoding, we would have to perform this conversion by iterating over individual words and converting letter by letter, additionally performing many bitwise operations,

To defend the honour of lists Big Grin

, one could try doing as much work as possible with lists. A faster encoding / decoding may be possible by employing lists in the encoding / decoding; just as a single BITAND could be used to mask all the words in an encoded dictionary (of short words - <= 10 characters) for matching purposes, BITAND, comparisons, etc. may be used on lists of lists during encoding / decoding.

It may well be that adding some list / string slicing methods to PPL might make this sort of coding more efficient.

The first step to go from strings to a 6-bit-per-char encoding might, if working with a 9-character dictionary, be to use MAKELIST(ASC(MID(...)),...) on 9-character chunks of the string dictionary (to generate a list of lists - with possibly one more level of looping / construction due to the limit of 10,000 elements per list).

My intuition would be that we'd want to get the letters packed into integers with the fewest number of PPL operations as, once packed, one level of work is being parallelised intrinsically (at the C++ level).

A single PPL subtraction (of 39) with the list of lists could shift all the characters (with "B" going to 27, "D" going to 29, ...). The next step would be to shift the accented letters closer to the unaccented ones so that integers may be assembled. One way to do this would be something like "decomposed+(decomposed>51)*-153" if decomposed is the list of lists. (The >51 masks for the accented characters, the multiplication slots in the translation to apply to accented characters, the + applies the translations: 0 for encoded A...Z, -153 for encoded accented characters; the preceding would likely be faster with integers instead of floating-point.)

(The above overlooks "-"...)

The next step would be to assemble the integers (with this 6-bit "in between" encoding) so that the shifting and masking can be done in larger chunks (at the C++ level). If lists can be arranged so that one list has the first character of each word, another has the second character, etc. (as I wrote earlier, additional PPL list / string slicing operations might be quite useful for this sort of coding; it may well be that introducing a bit of such things may be necessary to help lists shine with this kind of coding) then a single BITSL can apply a shift to all of the encoded first letters, another BITSL can apply a shift to all of the encoded second letters, etc.

Once collected so that 10 characters per word are packed into 64-bit integers, then masks and translations can reorder the encoding in bulk.

I started out writing this with the intention to clarify my earlier comments, but after writing all of the above... hmmmm... :-&