Hamilton -- a General Purpose Microframework for Scalable Feature Engineering
At Stitch Fix, data is integral to every facet of our business. We run a plethora of dataflows to transform raw data into features that models use to serve customers. We need to scale these dataflows both in code complexity, as we add new capabilities, and in data size, as we attain more customers. To ensure that these workflows don't devolve into an unmaintainable mess of spaghetti code (chalk full of in-place pandas operations), we built and open-sourced hamilton, a pluggable microframework to make scaling and managing complex dataflows easy. To use hamilton, one creates a dataflow by writing simple python functions in a declarative manner. The framework stitches them together, introducing an abstraction to configure and execute these dataflows. In this talk, we'll present the basic concepts of hamilton, discuss its impact at Stitch Fix, and share recent extensions to the project, including integrations with Dask, Spark, and Ray and why we're excited for its future.