Part 3: Fetch it!

25 Jun 2018

All posts from the series can be found here

All the functionality mentioned before was quite complete and I used it constantly for a year or even more. There were two drawbacks, however:

I had source code only for the posts I wrote using the client. Nothing from earlier days was preserved and in case livejournal.com goes down forever, I would lose them all.
All the posts had to be written and update later with the client, because it had no way to know about changes done elsewhere.

Both parts were not important for day to day writing activities, but the former kept returning after every next negative news about the service and the latter was more of a challenging task that I wanted to tackle. I resisted for a long time but finally gave up and decided to write the logic. Besides that, how could it be a real git like client if it has push command but didn’t have fetch or merge?

For the first question fetch was the most important bit. I assumed that if I downloaded posts in any form or shape I could work later to turn them into a markdown later without any hassle.

Luckily original Livejournal authors assumed such needs and designed API protocol to enable it (Kudos to Brad Fitzpatrick!).

First important bit is syncitems. What it allows to do is to get updates since some timestamp or from the beginning. A change can be a new post, an update to post or a comment, and since any post could be updated several times, duplicates were expected. The only thing protocol did not support were post deletions, which meant that I wouldn’t notice such a change.

I didn’t care about comments or deleted posts, however, what I wanted to get was a list of posts that changed somehow since timestamp. To make things even more complicated, api call didn’t return all changes but just some subset of it and I had to do several calls in a row to get the desired result.

Logic looked quite complicated and I ended up literally writing the steps algorithm should take and them slowly implementing them step after step and surrounding them with tests to keep it under control, you can check the final logic in get-fetched-item-ids function in lj-api. As you can see I used -<> threading macro all over the place to keep the code readable. Apart from the logic itself, I ended up implementing some function helpers that I didn’t find in the language but that we’re very useful.

The first example of this is acc. I do a lot of Perl and data traversal is really good there. What the language allows you to do is to easily access nested arrays without checking the existence of anything - $obj->{a}{b}{c}. Common Lisp getf is much less flexible and allows only one level lookup and this is where acc is useful since it allows any series of keys:

(defun acc (l &rest args)
  "Access member in a nested plist.

   Usage (acc l :one :two :three)"
  (cond
    ((null args) l)
    ((not (listp l)) nil)
    (t (apply #'acc (getf l (car args)) (cdr args)))))

Another two functions are partial and print-and-return. Name of the one first is self-explanatory and the next one accepts a value, prints it and immediately returns it. This is useful in the middle of threading macro call, because it allows printing intermediary steps in the computation.

One last perlism is to-hash-table. In Perl transformation between a list and a hashmap is extremely simple and happens all the time. This is how I would do it:

my %ht = map { $_->{id} => $_ } @list;

This simplicity means that most of Perl code consists of jumps between lists and hashes to use whatever works best in a particular situation. Posts are stored in the database as a list hence I had to iterate over it in a more or less smart way all the time. A list is a reasonable structure there and one of the ways to fix the problem with lookups was to maintain a hashtable in the class, but I I decided to go with a simpler one and convert the list to the table on the go.

(degun to-hash-table (l)
  (let ((ht (make-hash-table :test 'equal)))
    (dolist (item l ht)
      (setf (gethash (car item) ht) (cadr item)))))

This snippet highlights one of the neat features I like so much in Common Lisp. Maybe it has not the best possible standard library in the world, however, there are true gems of usability there. In this particular case, dolist accepts the third parameter which will be returned as a result of dolist. That means that you can write things like filling in hash table very naturally.

After get-unfetched-item-ids function was done, the next one on the list was one to download a list of posts. And here it got a bit tricky because of Unicode.

Unicode and Livejournal

Livejournal is there for quite a long time and that means that it’s codebase and data predates the times when everything became all Unicode. Many veteran coders can still show scars from that time, yeah. In essence, there was an ASCII table that was used to define a meaning of the byte of data (hence single byte encoding) and it defined the meaning of the first 127 values and left the rest undefined ant that was used as a space for extra characters for every specific alphabet. In Russia, there were two most popular encodings - cp1251 and koi8-r. How did it affect Livejournal? As is written in their FAQ, they had no way of understanding the encoding and hence it was left up to users to choose a proper one to render a page.

In case such encoding was chosen, Livejournal API allowed to download both Unicode and non-Unicode posts in the same way. Unfortunately, what I found empirically, serverside encoding did some weird stuff presumably because the text was already converted to Unicode somewhere along the way, and I had to disable it in order to get a properly readable way.

This decision had consequences in a sense that now I couldn’t download different types of posts in one batch - Livejournal api returned an error in this case. The beast I ended up writing is called lj-getevents-multimode you can check it out in lj-api. I baked in a couple of assumptions in the code:

One of the api versions (Unicode one) was much more probable
If the post had one version, the next one had a high chance of having the same one.

Final logic looked like this:

Try to download posts in one version
If fails reduce batch to one post and repeat
If that fails, flip download mode and repeat
If that fails, error out
If one of the previous steps succeeded, increase batch size 2x and repeat.

Maybe a bit naive approach, but it worked really well in the end. After these two bits were done, fetch logic started looking very simple:

(defmethod fetch-posts ((store <store>))
  "Fetch all new items from remote service since last-fetched-ts
   of the store"
  (multiple-value-bind
   (new-itemids last-item-ts ht) (get-unfetched-item-ids store)
   (cond
     ((null new-itemids) store)
     (t (let ((new-events (-<> new-itemids
                             (lj-getevents-multimode)
                             (getf <> :events)
                             (mapcar #'(lambda (x) (enrich-with-ts x ht)) <>))))
          (merge-events store new-events last-item-ts)
          (fetch-posts store))))))

<store> there is another class I created to store a list of downloaded posts in their original form. enrich-with-ts function exploited the fact that get-unfetched-item-ids could now server change dates for any updated post and download post api call did not return that and it simply added such a timestamp to every post. merge-events did no more than placing downloaded posts at the end of the list.

Now, where should I take this store? I mimicked the way I stored and saved the database with posts and added a hacky solution to do lazy loading of the database. I don’t think it’s necessary since posts are downloaded differently, but in case you wonder here is the implementation:

(defmethod slot-unbound (class (db <db>) (slot-name (eql 'fetch-store)))
  (setf (slot-value db 'fetch-store)
        (make-instance '<store>)))

This gets triggered whenever slot doesn’t have value set and function sets slot value after creating the class instance. Next time slot already has value and this method is not called anymore.

Top level fetch function now looked very simple:

(defun fetch-updated-posts ()
  (let ((store (restore-source-posts (fetch-store *posts*))))
    (fetch-posts *posts*)
    (save-source-posts store)))

After I got all this working I got a raw dump of all the posts I ever wrote, which meant that I could safely work on actually converting them back to markdown without fear to lose the contents.

Testing

To be frank sync protocol didn’t come for free to me. Too many moving parts and conditions. And while most of the code base has been written in a pure leisure fashion without a single test, I decided to build new features starting from fetch with at least some coverage. And I chose prove as a framework of choice.

The most annoying bit of this framework is it’s default reporter which uses escape control sequences for colors and emacs requires some additional configuration to make that work and that’s not something I wanted to invest my time in. Instead, I invested my time in finding a way to disable them. It appears that I’ve always been one dynamic variable away from the result:

(setf prove:*enable-colors* nil)

The overall tests integration could have been easier, however, still doable. I’ve made a test system, added a magical spell to the main asd file and test framework was set up. One of the important things to note is that prove itself is a dependency of the test system, hence in order to have it available in the repl this test subsystem should be loaded instead of the main one which will be loaded as a dependency.

A really awesome feature of the prove framework is that it allows to rerun specific tests just by recompiling them. This feature enables near magical workflows when I could prototype a feature and then cover it with tests in real-time without the need to run a full test suite again and again or to do a build every time and run tests there.

Since I wanted to write tests for the logic built around api calls I wanted to mock them to test outcomes of specific sequences of calls. I took mockingbird and it did provide me with a basic feature of mocking any function in any package, however, I ended up implementing a small macro to enable testing a sequence of calls.

Funnily enough, all the mocked calls ended up being trivial, but the possibility is still there! The idea was that if you want to mock a function, say foo and have it return (1 2 3) on the first call and nil on any subsequence you could just write:

(with-mocked-calls
  foo
  (
   (1 2 3)
   nil
   )
   (some)
   (code))

Macro allows generating a lambda function for mockingbird that has a baked in logic that tests against the number of calls and returns a respective result. Here is it:

(defmacro with-mocked-calls (func data &rest body)
  "This function is necessary to emulate the behavior of
   a function that has side effects. Every subsequent
   call to the function will return next item from the
   data list, except the last one which will be returned
   endlessly"
  `(with-dynamic-stubs
       ((,func
         (lambda (&rest rest)
           (declare (ignore rest))
           (cond
             ,@(loop for resp in data
                    for i from 1 to (length data)
                    collect
                    (if (equal i (length data))
                        `(t (quote ,resp))
                        `((equal (call-times-for (quote ,func)) ,i)
                          (quote ,resp))))))))
     ,@body))

I’m no macro guru and I get super excited about every single case when I managed to write something that actually looks like useful thing. You can check the tests for this definition and real-world usage.

Now, let’s merge.

Dmitry Petrov Back

Part 3: Fetch it!

Unicode and Livejournal

Testing