Jaql: A scripting language for large scale semistructured data analysis

  • Kevin S. Beyer
  • , Vuk Ercegovac
  • , Rainer Gemulla*
  • , Andrey Balmin
  • , Mohamed Eltabakh
  • , Carl Christian Kanne
  • , Fatma Ozcan
  • , Eugene J. Shekita
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

194 Citations (Scopus)

Abstract

This paper describes Jaql, a declarative scripting language for analyzing large semistructured datasets in parallel using Hadoop's MapReduce framework. Jaql is currently used in IBM's InfoSphere BigInsights [5] and Cognos Consumer Insight [9] products. Jaql's design features are: (1) a flexible data model, (2) reusability, (3) varying levels of abstraction, and (4) scalability. Jaql's data model is inspired by JSON and can be used to represent datasets that vary from flat, relational tables to collections of semistructured documents. A Jaql script can start without any schema and evolve over time from a partial to a rigid schema. Reusability is provided through the use of higher-order functions and by packaging related functions into modules. Most Jaql scripts work at a high level of abstraction for concise specification of logical operations (e.g., join), but Jaql's notion of physical transparency also provides a lower level of abstraction if necessary. This allows users to pin down the evaluation plan of a script for greater control or even add new operators. The Jaql compiler automatically rewrites Jaql scripts so they can run in parallel on Hadoop. In addition to describing Jaql's design, we present the results of scale-up experiments on Hadoop running Jaql scripts for intranet data analysis and log processing.

Original languageEnglish
Pages (from-to)1272-1283
Number of pages12
JournalProceedings of the VLDB Endowment
Volume4
Issue number12
Publication statusPublished - Aug 2011
Externally publishedYes

Fingerprint

Dive into the research topics of 'Jaql: A scripting language for large scale semistructured data analysis'. Together they form a unique fingerprint.

Cite this