1. 程式人生 > >How to write a cell address encoder in ruby.

How to write a cell address encoder in ruby.

How to write a cell address encoder in ruby.

I know that (unless you make a living building spreadsheets) you will probably never had to write an encoder like this one, but I think the techniques I’ll show you here are solid and may help you to solve other use cases as well.

In a future post I’ll cover why did I have to build such encoder, but for now, let’s say it could be handy in those cases where the amount of elements you have to address is fixed and the in memory storage you have to use sucks at handling strings.

So, now that we know why we may want to build an encoder like this, let’s build one.

The public API of our encoder will have 2 operations:

  • Encode. Takes a cell address and produces an integer.
  • Decode. Takes an integer and extracts the encoded address.

This is the way our encoder would be used:

encoder = encoder.newenc     = encoder.encode("A1") # -> numnum     = encoder.decode num   # -> "A1"

So far so good, but now we have to figure out a way to store a string like “A1” into a integer variable? (And how decode that integer back to string).

If you are thinking, “low level tricks”….

You are absolutely right! That’s exactly what we are going to do.

Let’s start by looking at the bit representation of the number 65535

, which is the maximum row number that our encoder would support.

Assuming 32 bits integers, the layout in memory looks something like this:

|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|

Unsurprisingly, it’s just a vector of 32 bits. Nothing interesting at all. What is interesting though, is the fact that to represent the maximum row number our encoder has to encode, only the lower (rightmost) 16 bits are required. That leaves us the uppers 16 bits (*) untouched. If we could find a way to read and write to those “wasted” bits, we could use them to store the column segment of a cell address. (Spoiler alert: Yes, there is a way. We can do that by using bit masks and bit operations.)

Now, let’s take a look at column names to see how the can fit in the wasted space. Assuming ASCII code points, we only need 7 bits to encode a char in the range A..Z (65..90). Since 7 is not a power of 2, and powers of 2 are easy to reason about (unless that holds true for me), let’s round that up to 8 bits per char. So, 16 spare bits, divided by 8 bits, equals 2. We can represent any column name in the range [A..AZ].

So, on 32 bits integers, we can store column names up to 2 characters long and row numbers is the range of (1..65535). This allows us to encode any cell address in the range [A1..AZ65535]. For those of you who are old enough… Yes; like the Excel from old times!

|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|^               ^               ^                               ^|   char 1      |     char 2    |          row number           |+-------------------------------+-------------------------------+|              16 bits          |          16 bits              |+------- column segment --------+-------- row segment ----------+

(*) Before going on, let’s shed some light on the jargon used before. Some times you will see the term: “low order bits”, or “less significant bits”, or “lower bits” or something like that; All of those terms refers to the same location, the rightmost “N” bits within the boundaries of a bits vector. In the other hand, when you see “high order bits”, or “most significant bits”, or “upper bits”, the reference is to the leftmost “N” bits within the boundaries of that vector.

Now that we know where to store the column name segment, let’s see how we can do it.

To get access to a specific location inside a bits vector, we can use a bit mask. A bit mask is a way to represent a subset of bits within a bits vector by turning on the specifics bits we are interested in. So let’s say we want to get access the segment where the first letter of column name will be stored, we can do that by using this bit mask.

|1|1|1|1|1|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|

As you can see, only the 8 leftmost bits are on. When applying this mask to a read/write operation, only the upper 8 bits are going to be affected.

While bitmaps allow us to access a particular segment, bit shift operators, bitwise and (&), and bitwise or (|); allow us to read and write to those segments.

In the code below you can see how those operations, combined with bit masks, allow us to encode/decode cell addresses.

# Bit masksMASKA = 0XFFFF               # <-  8.MASKB = 0XFFFFFFFF           # <- 16.MASKR = 0XFFFFFFFFFFFFFFFF   # <- 32.
# Returns an encoded value that represents a# cell address or -1 if arguments are invalid.# Valid column values are [a..az] and valid# row values are [1..65535].def encode addr    ok, col, row = parse(addr)    return -1 unless ok
    # gets the first and second "char" of the column.       a, b = split(col)    encoded = 0    encoded |= (a   <<  8)    encoded |= (b   << 16)    encoded |= (row << 32)
    encodedend
# Decodes and encoded cell address.# (See encode method.)def decode encoded    a = (encoded & MASKA) >>  8    b = (encoded & MASKB) >> 16    r = (encoded & MASKR) >> 32
    return "#{a.chr}#{r}" if b.chr == '_'    return "#{a.chr}#{b.chr}#{r}"end

To keep the code short and focus on encoding, I removed support methods like: parse, split, and friends, but you can take a look at them this gist.

And…, in case you may wonder… Yes, you can handle wider ranges by switching to 64 bit integers. (In fact, the “real” encoder that inspired this blog post, was written in C for 64 bits platforms. So yes, you can handle way larger ranges.)

Next time I’ll show you how to implement a fast in memory key/value storage using this encoder as key generator.

Thanks for reading! And have fun doing log level programming on ruby!