Navigation

Welcome


We have worked with hadoop for about a year now, and this site has been created to record some of our thoughts and ideas, as well as to share the code we wrote with the community.

Hadoop is written in Java and the most natural way to use it is by coding your own maps and reducers in Java too. The only catch in this "natural approach" is that you need to port your existing applications to Java if you want them to take advantage of this framework. For those who can not do it easily, there is a way round this – namely the streaming interface (this is the link to Hadoop 0.18.0 streaming package, once you get serious pick up the appropriate hadoop version – the interface changes a little). The nice thing about streaming is that you can plug in your own mappers and reducers, written in any way you can imagine. All they need to do is to read from STDIN and write to STDOUT.

There are a few limitations that became very apparent as we started to build complicated data processing pipelines using hadoop, and as we generally advised our clients on the technology. In no particular order here are our findings and our current thinking.

  • There is a problem with streaming itself – you do not have access to the full Java API. Somewhat annoying is the inability to easily supply a combiner class, although this can be somewhat helped by doing the "combining" as part of the map – you just need to locally sort the map-output.
  • Another issue we noticed is a practical problem with the map-reduce framework itself, namely that it is relatively low-level API. To use it you invariably create a structure on top of it, which gets rather complicated rather quickly. As a consequence, lots of time is spent on debugging, and the code becomes difficult to maintain in general. It only get worse as the complexity of the data flow grows. The community has recognized these problems and there are a few interesting (and rather different) approaches emerging to tackle those.
    • The most well known is probably the Pig programming language that "compiles" into a set of map-reduce jobs. It allows you to do rather complicated things in a seamless way.
    • If you want to treat your HDFS storage as some sort of "data-base" and keep SQL-like interface to the map-reduce framework, then check HIVE – a recently open sourced project by Facebook developers. The tutorial is very exciting.
    • Finally, we found the Cascading Java API very interesting. Again, it is built on top of map-reduce and Hadoop, but, as opposed to HIVE, it emphasize the "work flow" view over the data-base view. If you have some data transformations that need to run reliably and repeatedly over and over again, then definitely check them out.
  • We can expose HDFS through FTP. We develop FTP server on top of apache ftp framework HDFS over FTP
  • Another exposing options is exposing through WevDav. Our experience shows it's more unstable than FTP. Anyway you can check it HDFS over WebDav


Please send any suggestion/comments to hadoop AT iponweb  DOT net 




We are a consulting company – learn more at www.iponweb.net

Sign in  |  Recent Site Activity  |  Terms  |  Report Abuse  |  Print page  |  Powered by Google Sites