Data Crunching: Solve Everyday Problems Using Java, Python, and more. | 
enlarge | Author: Greg Wilson Publisher: Pragmatic Bookshelf Category: Book
List Price: $29.95 Buy New: $13.97 You Save: $15.98 (53%)
New (25) Used (9) from $12.25
Avg. Customer Rating: 12 reviews Sales Rank: 733491
Media: Paperback Number Of Items: 1 Pages: 176 Shipping Weight (lbs): 0.8 Dimensions (in): 8.7 x 7.5 x 0.6
ISBN: 0974514071 Dewey Decimal Number: 005.13 EAN: 9780974514079 ASIN: 0974514071
Publication Date: April 20, 2005 Availability: Usually ships in 1-2 business days Condition: All orders ship same business day via standard shipping (USPS Media Mail) if received by 1 PM CST.
|
| Accessories:
|
| Similar Items:
|
| Editorial Reviews:
Product Description Every day, all around the world, programmers have to recycle legacy data, translate from one vendor's proprietary format into another's, check that configuration files are internally consistent, and search through web logs to see how many people have downloaded the latest release of their product. This kind of "data crunching," may not be glamorous, but knowing how to do it efficiently is essential to being a good programmer. This book describes the most useful data crunching techniques, explains when you should use them, and shows how they will make your life easier. Along the way, it will introduce you to some handy, but under-used, features of Java, Python, and other languages. It will also show you how to test data crunching programs, and how data crunching fits into the larger software development picture.
|
| Customer Reviews: Read 7 more reviews...
Short, Informative, Useful and Clear August 15, 2006 2 out of 2 found this review helpful
Some of the best technical books are short, clear, easy to understand, and practical. Greg's book falls into this description. This a great book for exploring algorithms in the python language. The book assumes the reader has at least a basic understanding of the python programming language or some programming experience. I was delighted that topics were presented in a concise and unambigous way and that the book was short. There should be more short books published!
good data-handling cookbook for a beginner July 18, 2006 This book is mainly concerned with scripting as a 'glue' between applications: processing various input and output formats. The book is divided into 5 main categories of data handling: plain text, regular expressions, XML, binary data and SQL. There is a final chapter on various miscellaneous topics. Most of the examples are given in Python. Some of the code is demonstrated in Java, although, disappointingly for a book published in 2005, none of the Java 5.0 features are leveraged. However, if nothing else, it demonstrates why Java is not anyone's first choice for such activities.
If you've read any of the O'Reilly cookbook series, you will know what to expect, although the chapters are more cohesive and less episodic. Beginning programmers will get the most out of this book, although intermediate programmers should find at least some material here that's new to them.
The XML chapter is a pretty good introduction the use and advantages/disadvantages of SAX and DOM, and XSLT is also described, although the discussion is not so clear. Those without experience with databases will welcome the chapter on SQL. The discussion on dealing with plain text files in chapter 1 was highlight for me, a subject not often covered in much depth in cookbooks; if, like me, you still regularly need to convert between various plain text formats, this chapter will help formalise approaches that you may already be carrying out in a less than rigorous fashion.
Additionally, the paragraphs on floating point arithmetic were intriguing but all too brief. The chapter on dealing with binary is fairly good, although rather dry. Peter Seibel's discussion of binary data in the context of writing a Shoutcast server in Practical Common Lisp shows that the subject can be dealt with in a more compelling fashion. That said, for the most part, author Greg Wilson is a genial companion; the writing style is chatty, but doesn't overdo it.
Overall, if you own any cookbook-style books, there is little here that you don't already know. Even for a beginner, it's hard to see how anyone who decides they need this book hasn't already been exposed to some of the material here. In particular, does anyone really need yet another introduction to regular expressions? The treatment here isn't bad, it's just that this material is already covered in many introductory programming books (especially those that cover scripting languages like Perl and Python). As this takes up nearly 20% of the book, and there's less than 200 pages, it's a bit of a waste. Personally, I would have preferred more discussion of the less well-treated subjects, some of which are too sparsely described, but this would have detracted from the book's main aim.
This would be suitable for a beginner Pythonista, who for some reason didn't want the bulk of the likes of Python Cookbook. Otherwise, if you feel that some Pragmatic Programmers books can be rather lightweight and somewhat overpriced, this will not change your mind.
An overview of parsing and mining data with python. July 3, 2006 1 out of 1 found this review helpful
The book presents the topics in conjunction with showing some practical data mining examples that any person might encounter. This book is recommended to people who are interested in basic parsing of data (text, XML, binary, etc) using python.
I got the impression that the author was trying to cover too much in too little space. The title, for example, mentions Java, Python, and more. This is deceiving since the book uses python for about 99% of its examples. And while the book does present Java, it only does so to show that it would be easier to use python. Almost no other languages are covered, although there are some examples in Ruby and Bash.
A clever guide to extracting the data you need July 1, 2006 3 out of 4 found this review helpful
Data Crunching by Greg Wilson.
The book opens with a statement of purpose: transmuting data from one form into another. The focus is on problems where the hardest part is extracting the data, not problems where the hard part is processing it. Simple transformations and data grazing, rather than data mining, as the ideal problem for these techniques is small, separable, and useful in a variety of contexts.
Major book sections include: Text, Regular Expressions, XML, Binary Data, Relational Databases, and a twenty page miscellaneous section. As usual for Pragmatic Programmer's books, the text is short, coming out at 187 pages, with source code on the web site. That regular expressions come up pretty much immediately tells us that the text will be unix-heavy. Not a bad thing, really, as a few simple unix tools can often save hours of anguish.
Introduction:
The introduction is clear on the intended audience. Read the nine pages, and you have a darn good idea whether the book is worth reading for your tasks. Among other things, the tools he strongly suggests installing - Python, Java, a command line xslt processor, a relational database, and unix command line tools - make it clear the level of this effort. No GUIs, no complicated database reverse engineering tools. Note also, no Perl. (For me, a bonus, but a deal killer for some.)
Text:
Early in the text chapter, the author spends some time examining a data file, then writing out the result. He makes a point of looking up a spec, then ignoring most of it. The YAGNI (You Will Not Need It, for grammarians) says that detailed interpretation of the spec is not as important as carefully making sure it reads the files you actually have to process. After all, your files may be misformatted, or may only use a small fraction of the specification. His example showed that three iterations through some samples got him just about everything he needed, in a very short time. He looked at the input file spec to see if there were corner cases he would need to solve, then focussed on the actual conversion he wanted to make. This theme recurs often - simple data cruncher programs need to be correct, but they do not have the same needs as a general data parsing tool. Do not try to solve every conceivable problem, try to solve the one you actually have.
Interestingly, the author lumps Python, Ruby, and Java on one side, and Perl/C++ on another as far as 'thought collisions' go. Examples thus far include both Python and Java, with more Python than anything else. The Java examples are 1.2+ - using Java 1.5's autoboxing would have simplified several examples to roughly the Python complexity.
The author is not sloppy, but he does take judicious shortcuts. Trimming a file extension in python, he uses a hardcoded three character extension, rather than the more correct .splitExt(). He then mentions that the more correct function required an extra library include, which would have cluttered the text. (The text does mention the more correct function in a sidebar.) A page later, he points out the perils of repeating yourself, and the kinds of errors it can produce. In other words, a bad file extension is likely to cause a visible failure early, while repeated code can lead to subtle bugs. This kind of tradeoff comes up a lot when coding; I was glad to see him make a point of it.
Regular expressions:
Regular expressions are, to my mind, one of the more convoluted topics that programmers encounter on a regular basis. Getting a deep understanding is worthy of a book in itself, and using Perl or Ruby properly _requires_ that understanding. Java now has regex support, though python's is easier to use. He references the standard book on the topic.
The author tries to cover the 10% of the topic that you will regularly use. This will not turn you into an RE-master ready to tackle any Perl you happen to see, but it will be enough for the tasks he is describing. He spent quite some time describing character encoding, Unicode, ISO Latin-1, and the like, which was a welcome surprise.
I note further that not a single Perl example showed up in the RE chapter. Many Python excerpts, some Java excerpts, a lonely-looking Ruby script, but no Perl. Made me happy, but this might infuriate a Perl aficionado.
XML:
The XML chapter goes into a great deal of detail on SAX. The author compares SAX responders to GUI event responders, which felt a bit strained. Most GUI responders do not hove quite the order dependency that SAX responders do. That said, the introduction was clear, and the limitations were explicit. While he did discuss the need to keep state, the discussion came later than I liked.
The DOM section wisely uses language-specific APIs, like Java's JDOM and Python's minidom. I find tree-based APIs more useful than stream APIs in general, and this was a good, if brief, introduction to my most commonly used XML API.
The XPath section is also brief, but clear. Much like the DOM section, the author mentions language-specific APIs. Since many DOM-like APIs have an XPath module, even programmers not planning on doing much with XML will still find it useful. The author draws a paralell to regular expressions - a rich, dense language that can select small pieces out of a large mass of data.
Since I find XSLT a Martian space language, despite having used it heavily for several projects, I was pleased that the author's impression of it matched my own: "I don't really like XSLT that much". The introduction is clearly written, but unlike the other technologies, an introduction does not provide enough meat to accomplish a real problem.
Binary Data:
My rules of binary data: do not use it if you can use text, and if you must, try to find a library. This chapter mentions that on the second page.
The 19 pages of this section cover binary integer representations, string representations, bit shifting, designing self-contained comprehensible binary data formats, and packing as much data per byte as you can. The author did not mention recognizing and parsing gzipped textual data, but other that that, this chapter had a good collection of useful ideas. I have found text, xml, and databases to be more important in my work, but binary files do come up. This chapter reminded my why I always get a sinking feeling when they do.
Relational databases:
This 30 page section gives a good introduction to the practice of SQL databases, using sqlite as the engine. Simple joins and normalizing tables showed up by the fifth page of the chapter. Between aliases, nesting, and negation, the author claims that you should be able to do perhaps 90% of the queries you will need. I rather agree - with the caveat that you do need to understand left and right, inner and outer joins, which are not covered. (They are footnoted to a reference, so expect to snag a good SQL book.)
After joins, the chapter covers aggregation functions, views, and nulls. I give the author credit for bringing up the "Does NULL mean not present, or does it mean unknown" war. Most books take one side or another as gospel, or do not bring it up at all.
A section on creating tables, inserting/updating data, deleting data and tables, and transactions follows. The examples are typical, and appropriate for simple CRUD apps, like many web apps. The author points out that data crunching is far more likely to involve selection than complicated create/update/delete logic.
Finally, the author covers using SQL from python and from Java/JDBC. He also described the impedance mismatch between object oriented programming and the relational model. Wisely, he suggests well tested packages to handle that. I note that the vast majority of my sql has either been in scripts to set up test data or when I was writing an Object-Relational Mapping tool. The vast majority of my code that accesses databases uses well tested ORM packages like Hibernate. (I might have brought Hibernate up earlier in the chapter. Then again, for this book's target audience, perhaps not.)
Odds and Ends:
The final chapter has 19 pages on a variety of tools. The unit testing section discusses JUnit, Make, diff, and TDD. The encoding section discusses HTML escapes, base 64, and others. The section on floating point arithmetic answers the basic questions seen daily on Java discussion lists. Date parsing is discussed in sufficient detail, though I might have added an extra sidebar on just how bad Java's date handling can be. I have little to say about this section, save that it is worth the read.
Final Thoughts:
All in all, this book was well written, well proofed, and well designed. Like all Pragmatic books, it is available as both a downloadable PDF and a bound book. Errata and updates live on the Pragmatic web site. This is an extremely keen system - dead tree form for reading on a plane, PDF form for early access and up to date information.
It's about using the right tool for the right job June 13, 2006 3 out of 3 found this review helpful
Gregory Wilson likes Python and bash but doesn't particularly care for XSLT (or Perl, and possibly Java as well, either), doesn't express a preference in the great Emacs vs. Vi(m) holy war, and divides programming languages into two camps - agile, like Python and Ruby, and "sturdy", like Java. He's an adjunct CS professor at the University of Toronto, a contributing editor with Dr. Dobb's Journal, and is developing "Software Carpentry", which is either a basic course on software development aimed at scientists and engineers for the Python Software Foundation or a project to develop a newer, easier-to-use set of software development tools.
In the book, "Data Crunching: Solve Everyday Problems Using Java, Python, and More", data crunching is explored through a series of examples. The closest that Wilson comes to giving a definition is when, at the start of the first chapter, he refers to data crunching/munging as the "other 10%" of a programming task that takes up the "other 90% of the time". The first example that he gives is his experience helping a high school science teacher convert PDB (Protein Data Bank) files containing the coordinates of atoms in various molecules into a format that a Fortran sphere-drawing program could process.
From the introduction, he moves on to the manipulation of text and text files using Unix command-line tools and Python, with Java work-alikes following most of the Python scripts. Although the book's subtitle, "Solve Everyday Problems Using Java, Python, and More", gives Java first billing (possibly for marketing reasons?), Wilson's preference for Python over Java is never in doubt. After presenting the Java equivalent of a Python script that counts the number of times every email address appears in a list of email addresses, he writes:
All right. It's two-and-a-half times longer than the equivalent Python program, it isn't as fast on small files, and we have to compile it before we can run it, but other than that, it's almost as easy...
With a table of useful commands, explanation of redirection and piping, and some guidelines on how to make sure that your command-line tools follow convention, the text chapter could actually be viewed as a pretty passable introduction to the philosophy of Unix.
The chapter on Regular Expressions is great. So good, in fact, that I wish I could go back in time and give myself a photocopy of those thirty-odd pages at the point that I was struggling to get a handle on RE's some years back. Also included in this chapter is a brief, but very lucid, discussion of character encoding and a bit on using grep.
Although the Text and RE chapters were my favorite, Wilson's clear and concise writing style makes th eentire book, including the coverage of XML, binary data processing, and relational databases, a joy to read. With segues like "But wait a second. Wait just one pattern-matching second.", lists of email addresses to munge that include entries for Alan Turning, John von Neumann, and Grace Hopper, and the like, he also manages to inject some pleasant, if a bit groan-worthy, humor here and there into what could otherwise be a rather dry book.
He uses the last chapter, titled "Horshoe Nails" to quickly address a number topics, like encoding, the pitfalls of floating point arithmatic, and unit testing, which (not a surprise in a title coming from the Pragmatic Bookshelf) he likes, going so far as to say that the spread of test-driven development has been the "real revolution in programming in the last decade"). Diff is introduced and he brings the venerable make to the table as a tool for automating test running.
He doesn't say it in so many words, though his retooling the old saying that "two years of hard work can save you an hour in the library" as "an hour of hard work can often save you sixty seconds on Google" comes close, but the message is to work smarter rather than harder. Use industrial-strength tools and processes when industrial-strength solutions are called for and agile, simplest-things-that-work solutions whenever possible.
|
|
|