fileUploader.js: Resumeable HTML5 Uploads

fileUploader.js

fileUploader.js is a javascript library designed to interact with nginx's upload module, allowing for resumeable uploads.The js library works as follows:

var uploader = fileUploader(file, segmentSize, sessionId)

where file is a File object from the HTML5 File API, segmentSize is the size in bytes of each segment of an upload, and sessionId is the sessionId used to represent this file upload (must be unique!). The methods of the uploader object return Bluebird Promises/A+; which may be interacted with using this API. It offers the following methods:

  • uploader.fetchStatus(): returns a promise for a status object, which has the following structure:

      {
        completed : [true|false],
        start : <The first byte uploaded. Should always be 0>
        end : <The last byte uploaded>
        total : <The total size of the file in bytes>
      }
    

    completed should be true if and only if end === total - 1.

  • uploader.uploadSegments(status, onSegmentComplete): given a status object, will upload the remaining segments of the file. onSegmentComplete(newStatus) will be called after each segment is uploaded, with a status argument reflecting the new state. It returns a promise with the completed status.

Thus, to upload a file, one may do:

  uploader.fetchStatus()
      .then(function(status) {
          uploader.uploadSegments(status, function(newStatus) {});
      })
      .then( function(status) { console.log("upload complete"); } );

Design

Resumable uploads in the upload module work in the following way:

  1. Uploaded segments of a single file should share a session ID to identify them as part of the same file.

  2. When a segment of a file is uploaded, it will return a status in the body, e.g. 0-5,9-15/24, indicating that the file is 24 bytes long, and bytes 0-5 and 9-15 have been uploaded. Partial uploads will get a 201 response.

  3. Segments may be reuploaded, as long as it is not in parallel, and the segments have the same data for that segment.

The fileUploader js library will attempt a small upload (the first 2 bytes of a file; it turns out the upload module has a bug wherein it will not handle repeated uploads of the single byte of a file) to get the status of an upload. From that starting status, the uploader will incrementally upload the file in 1MB segments.

The Role of Testing in Design

I'm a huge proponent of unit testing. At my current workplace, I try to be a proponent of more emphasis on testing, because well-tested code with well-written tests [1] is a pleasure to work with. I also feel that test-driven development (TDD) is a valuable tool in the right place.

When is TDD useful? It's great for adding functionality, especially business functionality, to existing code:

  1. Write a test that clearly documents the intended business function.
  2. Implement the business function.

Writing the test first often clarifies what the desired business behaviour should be, and the test clearly documents and verifies that desired behaviour.

However, TDD does not belong everywhere. TDD can be a barrier when designing an interface or API. To expand on this, lets agree on something first: well-written tests test functionality and interfaces; they do not test implementation. It should be possible to modify an implementation with minimal or no changes to a test suite.

Design is an iterative process, and the first one is rarely the best one. Writing tests against an interface that is under design makes it more effort to change that interface, and the earlier the tests are written, the less likely it is that the interface will be iterated on. This results in suboptimal designs.

A simple example illustrates the problem. Suppose you imagine a Worker class that you know consumes objects. So you create an interface: consume(Object toConsume). In a TDD methodology, one would write tests defining the fashion in which the consume function works. However, this is not necessarily the right API. It may turn out that your Worker more naturally consumes sequences of objects instead: consume(Seq[Object] toConsume). At best, all your tests must be refactored to accommodate this. Worse, you might not remove the first, incorrect, consume method at all, instead leaving unnecessary complication in your API (and in your test suite!). Writing tests too early has increased the amount of work and may even result in a weaker API.

A Better Way

What is a better workflow? In my experience:

  1. Prototype a design until you understand it well and it has the properties you desire.
  2. Write tests and flesh out implementation.

This is more efficient and may result in more well thought-out interfaces. Note that it's important to still keep testing in mind when designing, however; an un-testable design is not a good one. Once you are satisfied with the design, writing tests against it should prove that it is testable.

[1] On the other hand, well-tested code with poorly written tests is often resistant to refactoring.

[2]

Stieltjes: a minimal Scala client for Riemann

Stieltjies is a minimal, UDP-only, Netty-based Scala client for Riemann, the events and monitoring system for distributed systems.

    import stieltjes._

    val client = new UdpRiemannClient("127.0.0.1")
    client.write(
      Event(Host("myhost"), Service("server"), State("ok"), Metric(3.0F)))

By default, the client uses client port 5556 and server port 5555.

Events are immutable, but can be used as templates for other events:

    val defaultEvent = Event(Host("myhost"), Service("server"), State("ok"), Ttl(20))
    ...
    client.write(defaultEvent(State("critical"), Metric(1000L), Description("critical error")))

Caveats

Because Stieltjes uses UDP, event delivery is not guaranteed. This should be used for high-volume stats tracking, rather than error reporting.

Because events are sent as fire-and-forget UDP packets, there is no mechanism for detecting if the Riemann server is down.

A Tour of Immutable Data Structures

An immutable data structure is one that cannot be modified after it is created. This seems to imply that any changes to the data structure require the entire object to be copied into the new, modified object. However, that's not the case. It's easiest to see why with a simple example:

Linked Lists

Consider an immutable singly-linked linked list:

d -> c -> b -> a

Let's call this list L. L can be stored with just a reference to the head of the list, the element d. Suppose we want to add an element ("e") to the head of the list, creating the new list L':

e -> d -> c -> b -> a

We now have two lists: L' (starting at the element e) and L (starting at the element d). We did not have to copy L to a new list to create L'

Now, suppose we want to add a list to the end of L: If we add it directly:

d -> c -> b -> a -> x

Then we will have modified L, because anyone accessing L (starting at d) now has an extra element on the the end of their list. To add an element to the end of L, we need to fully copy it:

d -> c -> b -> a
d -> c -> b -> a -> x

What's the difference between adding to the head of the list, and adding to the tail? No elements in L have pointers to the element a, but all the elements in L have pointers to the element x.

General Immutable Data Structures

We can generalize idea into a rule for general immutable data structures, as follows. Given two elements x, y in a data structure, say that y is reachable from x if there is a directed path of pointers from x to y. In the linked list L above, a is reachable from d, but c is not reachable from b. The rule is:

When modifying an immutable data structure at a given position, we must copy all elements from which that position is reachable.

In our linked list example, a modification to b requires d and c to be copied--but not a.

This idea turns out to be quite powerful. It implies that the immutable data structures which take the least time and space to modify are those containing only short paths. In a binary tree, modifying (that is, inserting, deleting, or changing the value of) a node requires that only the ancestors of the node being modified be copied to the new structure. This implies that creating a new binary tree with a modified element takes O(log n) time and space (assuming the original tree is balanced). Trees with higher fanout are even better: A tree with a fan-out of 64 and height of 10 can contain over 1017 elements, but modifying such a tree to create a new one takes only O(1) time and space. In fact, Scala's immutable Vector data structure use a structure like this.

Why You Should Use Immutable Data Structures

  • They're inherently thread-safe.
  • You can always copy an immutable object "by reference", because it will never change on you.
  • Code using immutable object is easier to read. After an immutable object is assigned to a value, it will never change.
  • As discussed above, performance is often comparable to that of mutable data structures.

Ruby Array Performance

The above chart summarizes the amortized performance of various ruby array operations. It shows that add (array concatenation), '<<' (append), and unshift are linear time, while pop and shift are O(1). It's interesting to note that like pop, shift is also implemented in a way that doesn't require the array to be copied. This might lead one to expect that shifting and then unshifting would be a cheaper operation than just unshifting (because by shifting first, we make room for the unshifted element, meaning no array copy is needed), but this isn't the case in ruby 1.9.2:

'<<' (append) is much cheaper than array concatenation using '+', but is still linear time when operating on a newly initialized large array, because at least one array copy is required to allocate more space.

However, the amortized time of appending N elements to an empty array is O(1).

Here's the code I used for this:

def benchmark(samples, sizes)
  results = []
  for n in sizes
    array = Array.new(n,1)
    start = Time.now
    samples.times{ yield array }
    results << (Time.now - start)*1000/samples
  end
  results
end

samples = 100
sizes = [10_000, 50_000, 100_000, 250_000, 500_000, 750_000, 1_000_000, 1_250_000, 1_500_000, 2_000_000, 3_000_000, 4_000_000, 5_000_000]

results = {}
results['<<'] = benchmark(samples, sizes) {|array| array << 1}
#results['shift'] = benchmark(samples,sizes) {|array| array.shift}
results['unshift'] = benchmark(samples,sizes) {|array| array.unshift(1)}
#results['shift then unshift'] = benchmark(samples,sizes) do |array|
#  array.shift
#  array.unshift(1)
#end
#results['add'] = benchmark(samples,sizes) {|array| array + [1]}
#results['pop'] = benchmark(samples,sizes) {|array| array.pop}

data = []
for key in results.keys
  data << sizes
  data << results[key]
end

require 'googlecharts'
puts Gchart.line_xy( :theme => :thirty7signals,
                      :title => 'Ruby Array Operation Speed',
                      :data => data,
                      :axis_with_labels => 'x',
                      :legend => results.keys,
                      :encoding => 'text',
                      :max_value => 'false'
                     ).sub!('chds=,', 'chds=a')