When the folks at ACM SIGMOD asked me to be a guest blogger this month, I figured I should highlight the most community-facing work I’m involved with. So I wrote up a discussion of MADlib, and that the fact that this open-source in-database analytics library is now open to community contributions. (A bunch of us recently wrote a paper on the design and use of MADlib, which made my writing job a bit easier.) I’m optimistic about MADlib closing a gap between algorithm researchers and working data scientists, using familiar SQL as a vector for adoption on both fronts.
Read More »
I sat at Berkeley CS faculty lunch this past week with Brian Harvey and Dan Garcia, two guys who think hard about teaching computing to undergraduates. I was waxing philosophical about how we need to get data-centric thinking embedded deep into the initial CS courses—not just as an application of traditional programming, but as a key frame of reference for how to think about computing per se.
Dan pointed out that he and Brian and others took steps in this direction years ago at Berkeley, by introducing MapReduce and Hadoop in our initial 61A course. I have argued elsewhere that this is a Good Thing, because it gets people used to the kind of disorderly thinking needed for scaling distributed and data-centric systems.
But as a matter of both pedagogy and system design, I have begun to think that Google’s MapReduce model is not healthy for beginning students. The basic issue is that Google’s narrow MapReduce API conflates logical semantics (define a function over all items in a collection) with an expensive physical implementation (utilize a parallel barrier). As it happens, many common cluster-wide operations over a collection of items do not require a barrier even though they may require all-to-all communication. But there’s no way to tell the API whether a particular Reduce method has that property, so the runtime always does the most expensive thing imaginable in distributed coordination: global synchronization.
From an architectural point of view, a good language for parallelism should expose pipelining, and MapReduce hides it. Brian suggested I expand on this point somewhere so people could talk about it. So here we go.
Read More »
MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!
Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:
- standard statistical methods like multi-variate linear and logistic regressions,
- supervised learning methods including support-vector machines, naive Bayes, and decision trees
- unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
- descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
- statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.
More methods are planned for future releases. Myself, I’m working with Daisy Wang on merging her SQL-based Conditional Random Fields and Bayesian inference implementations into the library for an upcoming release, to support sophisticated text processing.
Read More »
I don’t usually post about business deals on my blog. But today’s acquisition of Greenplum by EMC
is too close to home not to comment. I’ve been involved as a technical advisor at Greenplum for almost three years, and joined the EMC technical advisory board this spring — so I have some interest in the deal.
Below is my take on things from the technical side. Note that I’m not privy to any private information about the deal, and I’m generally more interested in the tech than the finance. No need to try and read financial tea leaves here — there aren’t any. This is a computer scientist’s view of the technology implications. Here goes:
Bright and early next Monday morning I’m giving the keynote talk at PODS, the annual database theory conference. The topic: (a) to summarize seven years of experience using logic to build distributed systems and network protocols (including P2, DSN, and recent BOOM work), and (b) to set out some ideas about the foundations of distributed and parallel programming that fell out from that experience.
I posted the paper underlying the talk, called The Declarative Imperative: Experiences and Conjectures in Distributed Logic. It’s written for database theoreticians, and in a spirit of academic fun it’s maybe a little over the top. But I’m hopeful that the main ideas can clarify how we think about the practice of building distributed systems, and the languages we design for that purpose. The talk will be streamed live and archived (along with keynotes from the SIGMOD and SOCC conferences later in the week.)
Below the break is a preview of the big ideas. I’ll post about them at more length over the next few weeks, hopefully in more practical/approachable terms than I’m using for PODS.
Read More »
We were happy to find out this week that our BOOM project and and Bloom langauge have been selected by Technology Review magazine as one of the TR10, their “annual list of the emerging technologies that will have the biggest impact on our world.” This was news to us — we knew they were going to run an article, but weren’t aware of the TR10 distinction. Pretty neat.
I’ve been getting a lot of questions since the article launched about the project and language. So while folks are paying attention, here’s a quick FAQ to answer what the project is all about and its status.
Read More »
It’s official: the name of the programming language for the BOOM project is: Lincoln Bloom.
I didn’t intend to post about Bloom until it was cooked, but two things happened this week that changed my plans. The first was the completion of a tech report on Dedalus, our new logic language that forms the foundation of Bloom. The second was more of a surprise: Technology Review decided to run an article on our work, and Bloom was the natural way to talk about it.
More soon on our initial Dedalus results.
Hadoop MapReduce is a batch-processing system. Why? Because that’s the way Google described their MapReduce implementation.
But it doesn’t have to be that way. Introducing HOP: the Hadoop Online Prototype [updated link to final NSDI '10 version]. With modest changes to the structure of Hadoop, we were able to convert it from a batch-processing system to an interactive, online system that can provide features like “early returns” from big jobs, and continuous data stream processing, while preserving the simple MapReduce programming and fault tolerance models popularized by Google and Hadoop. And by the way, it exposes pipeline parallelism that can even make batch jobs finish faster. This is a project led by Tyson Condie, in collaboration with folks at Berkeley and Yahoo! Research.
Read More »
For the last year or so, my team at Berkeley — in collaboration with Yahoo Research — has been undertaking an aggressive experiment in programming. The challenge is to design a radically easier programming model for infrastructure and applications in the next computing platform: The Cloud. We call this the Berkeley Orders Of Magnitude (BOOM) project: enabling programmers to develop OOM bigger systems in OOM less code.
To kick this off we built something we call BOOM Analytics [link updated to Eurosys10 final version]: a clone of Hadoop and HDFS built largely in Overlog, a declarative language we developed some years back for network protocols. BOOM Analytics is just as fast and scalable as Hadoop, but radically simpler in its structure. As a result we were able — with amazingly little effort — to turbocharge our incarnation of the elephant with features that would be enormous upgrades to Hadoop’s Java codebase. Two of the fanciest are: Read More »
Was intrigued last week by the confluence of two posts:
- Owen O’Malley and Arun Murthy of Yahoo’s Hadoop team posted about sorting a petabyte using Hadoop on 3,800 nodes.
- Curt Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 nodes
Both impressive. But wildly different hardware deployments. Why?? It’s well known that Hadoop is tuned for availability not efficiency. But does it really need 40x the number of machines as eBay’s Greenplum cluster? How did smart folks end up with such wildly divergent numbers?
Read More »