Last month, open-source data integration tool Talend Open Studio reached its version 5.0 milestone, packing a set of new and updated components for accessing and manipulating data stored in a broad range of formats, applications and repositories. What’s more, Talend has expanded its Open Studio brand to include data quality, master data management and enterprise service bus components, each of which had previously shipped as separate open-source tools.
For this review, I focused on the data integration component of the product, which I last reviewed about two years ago, in its version 3.1 incarnation. Since that time, Talend has bolstered the tool with nearly 200 new data components, most recently including elements for accessing .Net data structures, for working with Hadoop interfaces and for mapping XML data sources.
Also in version 5.0, the tool, which had previously enabled users to create data integration projects in Perl or Java, does away with its Perl capabilities. In my experiences with the tool, I stuck to Java-based projects, as TOS shipped with a broader range of Java-based data components.
Talend Open Studio is built on the popular Eclipse platform, which should make the tool familiar to anyone who’s used Eclipse or another Eclipse-based development tools. In addition, the Eclipse foundation provides TOS with excellent cross-platform support-the download for the product contains versions for Windows, Linux, OS X and Solaris.
Talend Open Studio for Data Integration is licensed under the GPL and available for free download at www.talend.com. For larger data integration projects, organizations can tap fee-based enterprise editions of Talend’s integration and data management tools that include additional features aimed at supporting developer teams. Talend has announced plans to follow the 5.0 editions of its open-source tools with updated enterprise editions by the end of the year.
Talend Open Studio in the Lab
I did most of my testing with the Talend Open Studio 5.0 RC3 on the 64-bit edition of Fedora 16. My test machines were equipped with 3GB and 4GB of RAM-I don’t recommend using any less, as TOS consumes a good deal of of memory.
On my Linux systems, I encountered a problem starting up TOS-the product requires Xulrunner from the Mozilla project, but the 2.0 version of Xulrunner that ships with Fedora wasn’t working. Mozilla offers 32-bit Xulrunner runtimes, but not 64 bit, so I compiled a 64-bit version of Xulrunner 1.9.x and specified the path to the runtime in my TOS_DI-linux-gtk-x86_64.ini file.
I installed and fired up TOS on a separate machine running the 64-bit edition of Windows 7 and encountered no such issues running the product.
I noticed that Talend has somewhat streamlined the initial startup process for the product. In previous versions, I’d had to create a repository and user account within which to build my projects. In version 5.0, the tool did away with the repository and user creation step, apparently taking care of this step automatically, behind the scenes.
I was instead prompted to create an account on Talend’s Exchange, a community for sharing custom components among Talend users. TOS now includes a portal to the Exchange, which is also accessible through the Web, into the product’s main interface. I managed to find a useful component for posting updates to Twitter in the Exchange, but I had to visit the Website to download this component-possibly because the component was marked as supporting only 4.x versions of TOS.
For a TOS test case, I set out to automate the reposting of my public updates on Google+ to my Twitter stream, pairing Talend’s JSON data components with the Google+ API, and the Twitter posting component I mentioned above with my Twitter account. In between, I used a Talend tMap component to combine multiple elements from the Google+ stream into the single message for posting to Twitter.
I found it easy to drag data components from the tool’s palettes to a design canvas on which I created data integration jobs. The most challenging part of the process turned out to be parsing the JSON data from Google+, particularly when neighboring posts in my stream included different pieces of data. For instance, posts consisting of a single chunk of text lacked the URL and image attachments of shared story links.
One of the newer feature additions to Talend Open Studio is an XML mapping tool that helps users grab the data they want from an XML source, much like the tools for sussing out the structure of CSV or other delimited data types that the product has long included.
In future versions of Talend Open Studio, I’d like to see a similar tool aimed at JSON-formatted input. In recent years, JSON has been overtaking XML in many of the Web service APIs that I encounter, and a stronger set of tools around JSON would be a welcome addition to the product.
During my tests, I worked on my job at my home and office machines, and found that it was easy to export my in-progress job to an archive file and import it onto my active system. When the time comes to deploy my job, TOS makes it easy to wrap up the job code and any dependencies into a WAR file for deployment on a Java application server.