Tuesday, September 18, 2012

Another tooling discussion for data mining, F#, OCaml, and Python.

As an F# programmer, I always keep an eye on its cousin, OCaml. OCaml and F# share a large portion of syntax, and an average F# programmer can pick up basic OCaml very quickly, say in one afternoon. And recently, the development of OCaml seems accelerated – exciting language features are added to the new releases: first-class modules (in 3.12) and Generalized abstract data types (GADTs) (in 4.0).  And for a good comparison in features of the two languages, please refer to this SO question.

Today, I’d like to start a short discussion on their application on a specific area, data mining.

The first question the reader may wonder is that “why do you want to use OCaml for data mining?” Yes. If F# makes everything convenient, there would less point for me to consider other languages. The main point for F# is that its cross platform support is poor. Maybe the language itself is fine as there is the Mono tool chain for F#, but the whole ecosystem is not cross-platform friendly. There is always one or two .Net libraries that are written specific for Windows, to name two libraries that I use: R.Net and Sho. Besides the library part, the Mono runtime has poor performance for the .net parallel library, and sometimes may crash. When I was in Microsoft Research Asia, we don’t have any cross-platform issues – every server is powerful with more than 12 cores and 32GB memory and is well installed with Visual Studio and other other tools. When back to HKUST, most of the servers are installed Linux for no good reasons (Win Server licenses are very cheap for our school, the IT guys are simply not experienced enough to make Win servers secured.)

With such constraints, we do have successful stories with F#. Recently I led a team winning the semantic place predict task in Nokia Mobile Data Challenge. Details here. The feature extraction program is written in F# and C#, and we can generate all features (from simple counting to FFT) from 50GB of text data within two hours! Thanks to F#’s asynchronous programming and other functional features!

But after that, I think we really need a better Windows server, or we have to use a language that is more friendly with Linux (at the same time, runs OK with Windows for I use Windows for development).

Currently I use Python. I write scripts in my laptop and when it works correctly on a small dataset, I put the program to the server to get it running on the full dataset, and usually the processing costs one hours to one night. Python is easy to write, and as it claims, easy indeed to read. But at the language level, cod reuse in a dynamic language is generally harder than a static language. The interface is too flexible and I usually dive into the struggle of which-one-to-choose.

On the other side, static language encourages refactoring and provides static types to facilitate code reuse. Consider the following F# code snippet:

 

    // a common pattern for accessing Stackoverflow csv file, question by question
let csv_iter (csvfile: string) (mapper: CsvReader -> string option) (mapfile: string) =
use csv = new CsvReader(new System.IO.StreamReader(csvfile),true)
let headers = csv.GetFieldHeaders()
let fieldCount = csv.FieldCount

use sw = new StreamWriter(mapfile)
use bad = new StreamWriter(mapfile + ".badids")
while csv.ReadNextRecord() do
match
mapper csv with
| Some text -> sw.WriteLine(text)
| None
->
bad.WriteLine(csv.[0])
sw.WriteLine()

sw.Close()
bad.Close()



From the function signature alone, one may guess what this function does. Basically, it process each record in a hugh csv file by a mapper function. The result of the function could be None, which means some errors occurs inside the mapper, or Some string that is a normal result. And all the results are stored in a file line by line. Let’s focus on the mapper function here, we have string option type in the function signature, which means we have a more specific requirement on the mapper than only requiring it to return a string. We have encoded the logic into the types. This is actually the main point of functional programming – types say more things than other languages. If we write this function in Python, the mapper would still return a string, however, the logic that the string can be nil is encoded in the implementation, but not in the type signature. In this case, we have to resort to natural language to comment the Python function that if the mapper gets error, please return None. And this comment is obviously less strict than types.



The above code is only the first version of this csv iterator; several enhancements can be added, e.g. we can define a more detailed type for mapper result, other than to use string option, which is too simple:



    type MapResult = 
MapError
of string * string
| NormalResult
of string



Now, the error has two parts: the error message and the best result string the mapper can get, which is set to empty string in the above csv_iter. In the stackoverflow example, if I cannot parse the question body, maybe I can use its title as a resort.



Say this is my old mapper, which gets the pure text from a question body, excluding its code and other annotations :



    let mapper_getbody_textonly (csv: CsvReader) = 
try
let
feat = createZero()
let markdoc = csv.[7] |> Markdown.Parse
calcTopParagraphs feat markdoc
Some (
let text = feat.PlainString.ToString()
text.Replace(
'\n', ' ').Replace('\r', ' ')
)
with _ ->
None

Now I have to change the code a little bit to make it compile:

    let mapper_getbody_textonly (csv: CsvReader) = 
try
let
feat = createZero()
let markdoc = csv.[7] |> Markdown.Parse
calcTopParagraphs feat markdoc
NormalResult (
let text = feat.PlainString.ToString()
text.Replace(
'\n', ' ').Replace('\r', ' ')
)
with _ ->
MapError ("markdown parser error", csv.[0])



In Python, the compiler will not tell you that you need to change the mapper! So you have to keep in mind a list of old mappers and change them one by one to suit the new interface. In F#, it is required and you don’t need to keep the mapper list as the compiler will tell you to change once you use an old one with the new interface.



With strict types, we are forced and helped by the compiler to change and refactor our code gradually. This is the main point I want to make about code reuse.



I keep 80% code in F# as long as my Windows laptop can run it. When it does not, say the processing costs more than 6 hours, I will write the rest 20% of code in Python and put it on the Linux server. This is my current resort.



So this is my little complain on Python, or on the dynamic languages in general. Let’s get back to OCaml at last. OCaml is great functional language. Among the practical functional languages, I think Haskell is the only one that is more interesting than OCaml in the functional aspect. However, Haskell is definitely not ready for large scale data mining tasks – its memory usage is unacceptable not to mention performance issues. So I was serious on using OCaml for my current task and writing all code in OCaml.



But the librariy availability on OCaml ecosystem bites me! I could not find a CSV library that supports streaming access and Unicode. I did find a simple Markdown parser in OCaml, but it parses a lot of SO Markdowns wrongly. Data mining practitioners usually depend a lot of tools for theirs tasks at hand, from solid CSV and database libraries, to linear algebra libraries, to NLP tagger, etc.. In this aspect, .Net and Python have better libraries than OCaml does.



So this ends my story in using OCaml this time.



I will probably come to OCaml in the future, and will try it for several times, I think. It is really a good language, and is more clean than F# in the functional aspect, e.g. .Net objects can still be null in F#, but there is really no null in OCaml. If everything in a project is better written from scratch, then OCaml can be a very good candidate.