Machine Gestalt

Programming, Machine Learning and Algorithms

So you want to hire good programmers?

This morning I read a post on readwriteweb about gamification of recruiting and companies hiring using one minute bug fixing tests.  You can view the post here if you're really interested.  This post really bothered me, because it shows that there are a lot of people that don't get it when it comes to attracting top notch talent.

I'm not going to say that gamification of recruiting and and the one minute tests are going to lead to bad hires.  The set of people who do well in those circumstances is not disjoint from the set of good coders, and it will weed out some people who really don't belong.  The problem is that they aren't selecting for the specific skillset that makes a really great developer.

Let me make it easy for you.  The type of developer you want views themselves as an artist.  They are in it for the long haul, and are constantly looking to improve their craft.  These people gravitate to interesting challenges, a chance to learn from other like minded people and creative autonomy.  You might get some people like this with contests, but mostly you are going to get bright, inexperienced college age kids who feel like they have something to prove.  To attract the artists, you need to show them something interesting.  Make an artist curious and he or she will find you.

This dovetails nicely with the best way to weed out artists from wannabes - code.  The GitHire people got it right; how you interact on a project is how you will perform on the job (minus in person social issues).  After providing something interesting to hack on, get people involved by offering bounties on bugs and features.  The money doesn't matter so much, but if you explain that you want to screen for new hires from the community around your project, it will show people that you are serious.  If someone looks promising, have them sign an NDA bring them in as a contractor until you're 100% certain they are a good fit.

I'm sure some people who read this will think to themselves "how do I create a community around my product while protecting my secret sauce?".  You people need to get over yourselves.  There are lots of big companies with hundreds of talented programmers at the ready who have probably already rejected your exact idea, "secret sauce" included.  The reason they aren't doing it is because it is a questionable prospect, and only big balls and sharp execution are going to make it work.  Big businesses are risk averse.  They will happily copy your business model and attempt out-market you, though.  Having strong community is a great bulwark against this.

Exploring the implications of being Pythonic

I like Python a lot; it has some nice properties, and there are a lot of well designed libraries for it.  That doesn't mean it doesn't frustrate me...

 

The word "pythonic" is thrown around a lot in the Python community.  Like most ideas, it looks great at first glance.  Unfortunately, it the Python community is often dogmatic about being "pythonic".  That is a great way to be part of a culture, but constrains your ability to address some problems.

 

This presentation I put together for a talk on the subject...

 

If programming languages were cars...

This is how they might look...

Please note that I've tried to capture the spirit of various languages, with some twists for entertainment value.  Don't take it personally if the car I selected for your language of choice isn't what you'd like, this is purely for entertainment :)

 

Fortran

Fortran

Sturdy and dependable, though not particularly maneurverable or sexy.

 

Lisp

Lisp
After all these years, no other car has vertical doors (which also kind of look like paretheses).

C
C

Still the best systems programming language 40 years later.

C++
Cpp
Because C wasn't tricked out enough.

Perl
Perl
Larry Wall was commanded by god to create this language... Rumor hasi t that the cryptic syntax contain clues about the life of Jesus after the ressurection.

Visual Basic
Visualbasic
Loads of fun to drive endlessly around the Microsoft app track.

Python
Python
A stylish ride, though not the fastest or most maneuverable.

Haskell
Haskell
How do I drive this thing?  Despite being older than many languages on the list, Haskell is more modern in most ways.

Lua
Lua
Cute, efficient, and becoming very trendy.

Java
Java
Not pretty or fun to drive, but solid.

Javascript
Javascript
Originally a go-kart much in the vein of Visual Basic, Javascript has become a reasonable language in its own right.  It still has a lot of warts and is marginally functional outside of its original domain.

PHP
Php
Initially popular because it was easy to install and let users embed code in web pages rather than write CGI applications.  Now it is the purple bat van of programming languages.  I'll leave the interpretation of that statement up to you.

Ruby

Ruby

Similar to Python in many ways, though arguably slightly more maneuverable and less elegant.

C#
Csharp
Based on Java and slightly more modern, but in some ways less solid. 

 

LuaJIT
Luajit

Though not a language in its own right, LuaJIT is an amazing technical accomplishment and that little car is just too cute not to share.

 

 

Clojure

Clojure

Lisp, tricked out for the JVM.   The Delorean monster truck is probably overkill, but you get the idea :) 

 

Why formal methods are the future of programming

After posting about design by contract and symbolic expressions in Python, I was curious about the current state of formal methods.  From what I can tell, not much has changed since the 80's, which is a real shame, because I feel formal methods have a lot to offer.

The driving use-case for formal methods in most instances is verification.  They provide this in spades, but usually with a significant cost in man hours.  Part of the reason for this is that most code is written in a highly imperative style with machine types that provide little for a theorem prover to latch on to.  People must go back and make declarative statements about the code and data (which are not always guaranteed to be correct).

Fans of statically typed languages might try to call this a win, but that is actually not the case.  Statically typed languages are only slightly better off than dynamically typed languages here, because generally language types only tell you that a certain set of properties is expected, with values of some other uninformative type.  As an example, if I wanted to prove that a function has an inverse, I need to prove that each operation has an inverse; there are a great many mathematical functions that do not have a general inverse, but have an inverse function for a subset of their domain.  If I know the input is constrained to an invertible domain, I can proceed.  Just knowing that the input is a float doesn't help me.

If formal methods were just about proving code so I can be lazy about tests, they wouldn't be exciting to me.  The reason I am really interested, is that formal methods can be used to generate code.  As an example, lets look at Prolog.  In Prolog, you make declarative logical statements, then perform queries, which are resolved by chaining first order logic statements, until a path from your starting point to your endpoint is found or the search space is exhausted.  Prolog frames this is terms of querying relations like (socrates, man), (man, mortal) therefore (socrates, mortal).  If we imagine this in terms of defining a socrates_mortal function rather than performing a query for (socrates, mortal),  it should be clear what I mean.  If every function in your language is treated like a declarative statement of relation between X (the inputs) and Y (the outputs), you could use a mathematical proof assistant (in this case, probably using higher-order logic) to produce code from some X to some other Y.  If a solution returned is not satisfactory for whatever reason, you could provide additional constraints.  If no solution is found, it is possible in some circumstances that the proof assistant could suggest a function that would provide a solution and ask you to code it.  Additionally, if you change a function, the proof assistant could immediately tell you if that breaks anything.

I feel that the combination of provably correct code (which saves an enormous amount of time in testing) and automatic code generation makes formal methods unstoppable.  In order for this sort of technique to work, metadata has to be ubiquitous in the language.  As I mentioned previously, metadata does not imply everything is typed, but rather that there are constraints on the behavior of everything.  As an example, take str.split(), for which we we know the following:

  • takes a collection X
  • optionally takes a separator S
  • optionally takes a maxsplit, M
  • returns a collection of collections Y
  • X != Y
  • len(Y) <= len(X)
  • if M, len(Y) <= min(len(X), M)
  • len(Y[n]) <= len(X)
  • type(Y[n1]) == type(Y[n2]) == type(Y[2]) ==  ...
  • Y[n] in X
  • if S, Y[0] + (S + y for y in Y[1:]) == X
  • if S and len(Y) == 1, S not in X (and the converse, if S not in X, len(Y) == 1)
  • if S and len(Y) > 1, S in X (and the converse, if S in X, len(Y) > 1)
  • etc...

As you can see, there is actually quite a lot of useful information, even from a simple function like split.  This is the sort of stuff a theorem prover needs in order to go to town.  Then, instead of writing a function that goes through a bunch of steps to complete a process, I just specify the start and end points, add some constraints on the solution and let the proof assistant give you a provably correct solution automatically.

There are complications in this process of course.  For one, exceptions throw a minor monkey wrench in the works.  The search process is also capable of producing very bad functions, so a modified search strategy would be required.  These are relatively minor issues though, and are both dwarfed by larger issues:

  • How do we get programmers to perform thoughtful annotation?
  • What sort of interface is required to let a programmer work collaboratively with a proof assistant?  (hint: the editors and IDEs we are all used to aren't it).

Despite the minor issues and major paradigm shift required in what it means to program, I feel that formal methods provide the best method to handle the trend towards massively increasing software complexity.

Why symbolic objects/expressions are important for Python

I recently made the case for symbolic programming in the Python ideas group. Reactions were mixed, but that is to be expected with any significant suggestion.  I think by fleshing out the idea somewhat, adding some use cases, and perhaps creating an example implementation in PyPy would go a long way towards demonstrating how amazingly useful this could be.

I came to this point after being frustrated with the with data manipulation is typically handled in Python.  List comprehensions, map and other methods of performing a function on every element of a python iterable just feel kind of clumsy when you are focused on data transformation, particularly when you start to get any sort of expression complexity.  The standard Python technique to handle this is to assign intermediate values, however this breaks down when considering a potentially infinite sequence of input.  Additionally, if there are side effects when accessing a variable that are partially dealt with in the next step of computation, by performing everything at once you minimize the time that these effects are present in the system.  This can be mitigated to some degree by returning generators everywhere, but that clutters your code.

After some comments from Nick Coghlin, I put together a little library called Elementwise to allow chaining, generative expressions that can be lazily applied to every element of an iterable (or even a graph of iterables).  This was a fun project to write, and I got to check out a lot of areas of Python I don't usually explore, such as the alternate use of FunctionType and cell objects.  I Hope some people who read this take the time to check it out :)

I appreciate the code as data and philosophy of Lisp, and how thoroughly Mathematica approaches symbolic programming.  Let me give a little example of how I imagine those features would appear behind Python tinted lenses...

Assume for a moment that we have a special SymbolicType type where any attempt at attribute lookup (including special methods that skip getattribute access like __add__ and __mul__) causes the expression represented by the calling stack frame to be folded into a coroutine object and returned immediately.  You realize the data by send()ing to the coroutine (whatever the method is called, evaluate has a nice ring too).  These SymbolicTypes would take a default value like a function argument as well, to provide support for function partials.

Subclasses of this SymbolicType can hook into attribute access just like any standard Python class.  These might not behave like standard python methods, because I feel it is more useful to hook your logic in after you have received a value.  What is more likely is overriden behavior on the metaclass of SymbolicType __new__ method would wrap supplied methods, treating them like callback functions.

With this, you can chain together operations on SymbolicType instances generatively, creating abstract expressions.  This lets you define functions in-line in a manner similar to lambda, over multiple lines, in a clean manner, without the annoying lambda statements everywhere.  You could even take the next step by writing your functions in an entirely symbolic manner, and use them as templates for non symbolic functions.  This would allow you override specific parts of functions in a very clean way.  Additionally, functions that operate on other functions would become much more powerful.  Like in mathematica, you could create closed form solutions by operating on the function rather than the data they generate.  Logic programming (which is pretty neat but can be somewhat counter-intuitive in use) would be easily usable in the context of regular code.

To be sure, this is all a big departure from Python as it exists now.  The beauty of it is that it is all available as a result of the SymbolicType's behavior on attribute acces.  With this one feature, the rest of Python will grow and evolve to incorporate symbolic features over time, in an organic way.  Nobody will be forced to change the way they code if they don't want to use SymbolicTypes.  A lot of the nice things about languages like Haskell/Lisp, Prolog and Mathematica would be available without sacrificing Python's intuitiveness and general "pleasantness of use".

Making Python 3 annotations useful

I recently lamented in the Python mailing lists that a great deal of useful information was being thrown away.  Initially, my thought was that perhaps returning some kind of parameterized generic collection instead of an ordinary collection would provide a clean way to preserve this information.  After some interesting back and forth, I realized that the conversation would be better framed in terms of general metadata.  From this perspective, Python 3 already has a tool that is fairly well suited to solve the problem: Annotations. Because it was decided to let a solution evolve organically, a chicken and egg problem was created, where tool makers don't want to support specific annotations because nobody is using them, and nobody uses the annotations because no tools take advantage of them.  Additionally, the standard library is a major offender with regard to throwing information away, and a third party library can't fix in any sort of clean way.

I believe that a fairly well tested paradigm exists to deal with the metadata issue while simultaneously bringing a lot of additional benefits to the table: Design by Contract.  There is a PEP for the addition of contracts to Python, however the reference implementation is very rough and dated.  I looked at the discussions related to PEP316, and came across the following objections:

  • They are "overengineering", by which I believe the original author meant that Python is simple and Contracts have the ability to add complexity, thus contracts are bad.
  • Unit tests are cleaner and more effective.
  • You can already implement contracts using annotations or third party libraries (which is actually addressed in the PEP).

The "overengineering" view is purely philosophical, and I disagree with in on the grounds that if taken to its logical conclusion, Python would be relegated to basic scripting and as an educational language.

I understand the "unit tests are cleaner and more effective" argument, but I feel that it is misguided for the following reasons:

  • Contracts are a clear statement of axioms, whereas unit tests are not implicitly clear.
  • If the contract implementation requires similar setup to unit tests, then unit tests are "cleaner" because they do not litter the code with testing artifacts; thus this point is predicated on a poor implementation of contracts.
  • Unit tests are more effective than contracts, given that you can test very arbitrary conditions; I do not feel that the two are mutually exclusive though, and contracts could even be used to simplify tests.
  • Contracts can be used to automatically generate tests.
  • Contracts provide metadata which can be used for purposes other than testing, such as static analysis, optimization and documentation.

The point about implementing contracts using annotations or third party libraries is valid, however I would argue that having standard library functions implement contracts would greatly enhance the scope and magnitude of static analysis and optimization opportunities.  The only issue with this (and it is significant) is that contracts are defined in such a way that subclassing has very specific effects on pre and post condition strength, which will probably not be uniformly respected.

I think abstract base classes provide the best mechanism for specifying contracts.  The main issue with this is that constantly creating abstract base classes to define contract elements is likely to become tedious, although this represents an opportunity for third party libraries.  Another potential issue with this is that if abstract classes are not used, you are basically just type checking, which might be useful in some contexts but should be discouraged in general.

 

 

 

Personal Project Outline: Next generation URL routing engine

I really like Werkzeug and Flask a lot.  My main frustration in using them is at having to write views that dispatch to other functions based on a variety of information in the request, it makes the code harder to read and debug.

My ideal routing engine would be able to route to a view function using absolutely anything in the request, as well as available layer 3/4 protocol information.  Some things I would like to route by that aren't usually routable include:

  • Accept header
    • Avoid having a ridiculous number of different paths to the same resource
  • Content-Type
    • Again, avoid multiple paths to the same resource
  • Cookies
    • Affiliate stuff
    • Divergent user interfaces
  • X-Requested-With
    • Serve chromeless pages
  • User-Agent
    • Skip the annoying workarounds, just serve short bus browsers their own "special" content.
  • Referer
    • Any sort of affiliate stuff
  • URL query args (yes I realize this is available in some routing engines)
  • IP Address
    • Mobile users on wireless connections should get as much fat trimmed as possible
    • Special site features for certain domains
    • Marginal geolocation when HTML5 geo services aren't available

Currently, if you want to deal with this stuff using Werkzeug routing, you either end up with huge, messy view functions or small view functions that redirect to a variety of other functions, which can make following the code execution path slightly more difficult.  Additionally, this is an easy place for bugs to occur, and unless you create a mini-router abstraction you are probably violating DRY.

People familiar with URL routing engines will probably be wondering two things right about now:

  1. How could this possibly be performant?
  2. How on earth am I going to resolve the route, given the complexity of some header elements?

To answer the first point, I have been thinking about an algorithm that won't just avoid serious slowdown, but actually be incredibly fast, likely much faster than current approaches

  1. Generate a Radix tree from rule URLs so you can parse incoming request URLs into tokens quickly
  2. Tokenize rule URLs into sequences by splitting at substitution element boundaries.
  3. Group and transform routing rule elements into lower dimensional "features"
    1. If a header element is specified for a resource that should produce an error under some circumstances  (e.g. method and 405, accept and 406) and there is no rule for the corresponding path, an error view function is created for it.
    2. Rule elements from the set of eligible rules are combined into features based on some information theoretic quantity, such as variation of information 
    3. Only consider up to and including the first non matching URL token of the sequence
    4. Transform rules into the new feature space
  4. Select a feature to use for a "split" in order to optimize some information theoretic quantity
    1. It is possible that this step could be combined into step 2 of the rule transformation process, for instance with a minimum spanning tree algorithm.  The viability of this needs to be examined, and is only possible if a metric is used.
  5. Generate a map from feature value to a second empty map
  6. For each empty map generated, repeat the process from step 3, using the next unused URL sequence token

The process of resolving a route involves the following process:

  1. Find a Tokenize the URL of the request using the Radix tree to the maximum depth possible given the current branch
  2. Since many headers can contain sequences of values (for instance, Accept) and it is not possible to ascertain a match for a URL match token a priori, a closure (in the form of a generator function) must be created
    1. The closure function lazily returns the product of all indeterminate elements in the feature in order of preference for header fields, and specificity for matching URL tokens.
  3. Use the transformation function from step 3 in the tree creation process and candidate values from the closure function described in step 2 to generate a candidate match
  4. Try to traverse the tree using the candidate, repeating the process from step one for each sub branch
  5. If the traversal is successful, return the view function
  6. If the traversal is not successful, try the next candidate match.
  7. When all candidate matches are exhausted, return a 404 response.

This route data structure is closely related to a Radix tree, and borrows much of of the tree construction process from decision trees.  The benefit of this structure is that it will be possible to identify many routes most routes in O(log n).  Only routes that are highly ambiguous due to poor use of matching elements will approach the O(n) time behavior seen in most routing engines.

Personal Project Outline: Merger of PyLit and Sphinx-autodoc

I've always been irritated by having to maintain sphinx documentation separate from my python code. Literate Programming is a nice way to deal with the separation of documentation and code.  There is a fairly well developed literate programming tool for python called PyLit that is capable of performing round trip conversion between reStructured Text and Python, and it has a certain amount of charm.  I haven't adopted PyLit because would prefer to stay a little closer to the standard Sphinx API documentation format; it is easy to find things and people are used to it.

What features would the ideal union of PyLit and Sphinx-autodoc have?

  • Inline text blocks with full Sphinx support 
    • Table of contents entry generation
    • Cross referencing
    • Indexing
  • Source element (function, class, method, attribute and text block) grouping
    • Provide a finer grain of control for Sphinx-autodoc member grouping
    • Clearest order of elements in source is not usually the clearest order of elements in docs
  • Generate docs directly from source, without having to maintain separate text files
  • Integrate code view more tightly into documents, perhaps with a collapsible page element immediately below the signature
    • Omit the signature and docstring
    • Perhaps put some work into the code crosslinking

Technical issues:

  • The autodoc visitor would need to be able to include text blocks
    • Designate with contiguous "#:" commented lines with one or more whitespace only lines both preceeding and following
  • The autogen script must be updated to support the required features 
  • A group directive must be created
  • Autodoc's member-order directive and code will need to be updated
    • Needs to account for text blocks
    • Needs to support explicit ordering of groups
  • The viewcode extension will need to be hacked to provide inline function code
    • The theme will need to be updated to support code element collapse