Скачать книгу

stacks, both of them sorted, you can easily collate them into an ordered stack of four. Repeating this trick a few times, you’d build bigger and bigger stacks, each one of them already sorted. Soon enough, you could collate yourself a perfectly sorted full deck—with a final climactic merge, like a riffle shuffle’s order-creating twin, producing the desired result.

      This approach is known today as Mergesort, one of the legendary algorithms in computer science. As a 1997 paper put it, “Mergesort is as important in the history of sorting as sorting in the history of computing.”

      The power of Mergesort comes from the fact that it indeed ends up with a complexity between linear and quadratic time—specifically, O(n log n), known as “linearithmic” time. Each pass through the cards doubles the size of the sorted stacks, so to completely sort n cards you’ll need to make as many passes as it takes for the number 2, multiplied by itself, to equal n: the base-two logarithm, in other words. You can sort up to four cards in two collation passes, up to eight cards with a third pass, and up to sixteen cards with a fourth. Mergesort’s divide-and-conquer approach inspired a host of other linearithmic sorting algorithms that quickly followed on its heels. And to say that linearithmic complexity is an improvement on quadratic complexity is a titanic understatement. In the case of sorting, say, a census-level number of items, it’s the difference between making twenty-nine passes through your data set … and three hundred million. No wonder it’s the method of choice for large-scale industrial sorting problems.

      Mergesort also has real applications in small-scale domestic sorting problems. Part of the reason why it’s so widely used is that it can easily be parallelized. If you’re still strategizing about that bookshelf, the Mergesort solution would be to order a pizza and invite over a few friends. Divide the books evenly, and have each person sort their own stack. Then pair people up and have them merge their stacks. Repeat this process until there are just two stacks left, and merge them one last time onto the shelf. Just try to avoid getting pizza stains on the books.

      Beyond Comparison: Outsmarting the Logarithm

      In an inconspicuous industrial park near the town of Preston, Washington, tucked behind one nondescript gray entrance of many, lies the 2011 and 2013 National Library Sorting Champion. A long, segmented conveyor belt moves 167 books a minute—85,000 a day—through a bar code scanner, where they are automatically diverted into bomb-bay doors that drop into one of 96 bins.

image

      A Mergesort in action. Given a shelf of eight unsorted books, start by putting adjacent books into sorted pairs. Then collate the pairs into ordered sets of four, and finally collate those sets to get a fully sorted shelf.

      The Preston Sort Center is one of the biggest and most efficient book-sorting facilities in the world. It’s run by the King County Library System, which has begun a healthy rivalry with the similarly equipped New York Public Library, with the title going back and forth over four closely contested years. “King County Library beating us this year?” said the NYPL’s deputy director of BookOps, Salvatore Magaddino, before the 2014 showdown. “Fuhgeddaboutit.”

      There’s something particularly impressive about the Preston Sort Center from a theoretician’s point of view, too. The books going through its system are sorted in O(n)—linear time.

      In an important sense, the O(n log n) linearithmic time offered by Mergesort is truly the best we can hope to achieve. It’s been proven that if we want to fully sort n items via a series of head-to-head comparisons, there’s just no way to compare them any fewer than O(n log n) times. It’s a fundamental law of the universe, and there are no two ways around it.

      But this doesn’t, strictly speaking, close the book on sorting. Because sometimes you don’t need a fully ordered set—and sometimes sorting can be done without any item-to-item comparisons at all. These two principles, taken together, allow for rough practical sorts in faster than linearithmic time. This is beautifully demonstrated by an algorithm known as Bucket Sort—of which the Preston Sort Center is a perfect illustration.

      In Bucket Sort, items are grouped together into a number of sorted categories, with no regard for finer, intracategory sorting; that can come later. (In computer science the term “bucket” simply refers to a chunk of unsorted data, but some of the most powerful real-world uses of Bucket Sort, as at the KCLS, take the name entirely literally.) Here’s the kicker: if you want to group n items into m buckets, the grouping can be done in O(nm) time—that is, the time is simply proportional to the number of items times the number of buckets. And as long as the number of buckets is relatively small compared to the number of items, Big-O notation will round that to O(n), or linear time.

      The key to actually breaking the linearithmic barrier is knowing the distribution from which the items you’re sorting are drawn. Poorly chosen buckets will leave you little better than when you started; if all the books end up in the same bin, for instance, you haven’t made any progress at all. Well-chosen buckets, however, will divide your items into roughly equal-sized groups, which—given sorting’s fundamental “scale hurts” nature—is a huge step toward a complete sort. At the Preston Sort Center, whose job is to sort books by their destination branch, rather than alphabetically, the choice of buckets is driven by circulation statistics. Some branches have a greater circulation volume than others, so they may have two bins allocated to them, or even three.

      A similar knowledge of the material is useful to human sorters too. To see sorting experts in action, we took a field trip to UC Berkeley’s Doe and Moffitt Libraries, where there are no less than fifty-two miles of bookshelves to keep in order—and it’s all done by hand. Books returned to the library are first placed in a behind-the-scenes area, allocated to shelves designated by Library of Congress call numbers. For example, one set of shelves there contains a jumble of all the recently returned books with call numbers PS3000–PS9999. Then student assistants load those books onto carts, putting up to 150 books in proper order so they can be returned to the library shelves. The students get some basic training in sorting, but develop their own strategies over time. After a bit of experience, they can sort a full cart of 150 books in less than 40 minutes. And a big part of that experience involves knowing what to expect.

      Berkeley undergraduate Jordan Ho, a chemistry major and star sorter, talked us through his process as he went through an impressive pile of books on the PS3000–PS9999 shelves:

      I know from experience that there’s a lot of 3500s, so I want to look for any books that are below 3500 and rough-sort those out. And once I do that, then I sort those more finely. After I sort the ones under 3500, I know 3500 itself is a big section—3500–3599—so I want to make that a section itself. If there are a lot of those I might want to fine-tune it even more: 3510s, 3520s, 3530s.

      Jordan aims to get a group of 25 or so books onto his cart before putting them in final order, which he does using an Insertion Sort. And his carefully developed strategy is exactly the right way to get there: a Bucket Sort, with his well-informed forecast of how many books he’ll have with various call numbers telling him what his buckets should be.

      Sort Is Prophylaxis for Search

      Knowing all these sorting algorithms should come in handy next time you decide to alphabetize your bookshelf. Like President Obama, you’ll know not to use Bubble Sort. Instead, a good strategy—ratified by human and machine librarians alike—is to Bucket Sort until you get down to small enough piles that Insertion Sort is reasonable, or to have a Mergesort pizza party.

      But if you actually asked a computer scientist to help implement this process, the first question they would ask is whether you should be sorting at all.

      Computer science, as undergraduates are taught, is all about tradeoffs. We’ve already seen this in the tensions between looking and leaping, between exploring and exploiting. And one of the most central tradeoffs is between sorting and searching. The basic principle is this: the effort expended on sorting materials is just a preemptive strike against the effort it’ll take to search through them later. What the precise balance should be depends on the exact parameters of the situation, but thinking about sorting as valuable only to support future search

Скачать книгу