Pig Design Patterns
上QQ阅读APP看书,第一时间看更新

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. The examples in this book are tested by compiling against Pig Version 0.11.0. Many of the Pig scripts, UDFs, and data are available from the publisher's website or GitHub.

You can also use https://github.com/pradeep-pasupuleti/pig-design-patterns.

The Pig Latin script examples are organized by chapter in their respective directories. UDFs of Java and Python are also part of the chapter directory organized in a separate subdirectory by the name src. All datasets are in the datasets directory. Readme files are included to help you get the UDFs built and to understand the contents of the data files.

Each script is written with the assumption that the input and output are in the HDFS path.

Third-party libraries

A number of third-party libraries are used for the sake of convenience. They are included in the Maven dependencies so there is no extra work required to work with these libraries. The following table contains a list of the libraries that are in prevalent use throughout the code examples:

Datasets

Throughout this book, you'll work with these datasets to provide some variety for the examples. Copies of the exact data used are available in the GitHub repository in the directory https://github.com/pradeep-pasupuleti/pig-design-patterns. Wherever relevant, data that is specific to a chapter exists within chapter-specific subdirectories under the same GitHub location.

The following are the major classifications of datasets, which are used in this book as relevant to the use case discussed:

  • The logs dataset contains a month's worth of HTTP requests to the NASA Kennedy Space Center WWW server in Florida. These logs are in the format of Apache access logs.

    Note

    The dataset is downloaded from the links ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz and ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz.

    Acknowledgement: The logs were collected by Jim Dumoulin of the Kennedy Space Center, and contributed by Martin Arlitt () and Carey Williamson () of the University of Saskatchewan.

  • The custom logs dataset contains logs generated by a web application in the custom log format. Web service request and response information is embedded along with the event logs. This is a synthetic dataset created specifically to illustrate the examples in this book.
  • The historical NASDAQ stock data from 1970 to 2010, including daily open, close, low, high, and trading volume figures. Data is organized alphabetically by ticker symbol.
  • The customer retail transactions dataset has details on category of the product being purchased and customer demographic information. This is a synthetic dataset created specifically to illustrate the examples in this book.
  • The automobile insurance claims dataset consists of two files. The automobile_policy_master.csv file contains the vehicle price and the premium paid for it. The file automobile_insurance_claims.csv contains automobile insurance claims data, specifically vehicle repair charges claims. This is a synthetic dataset created specifically to illustrate the examples in this book.
  • The MedlinePlus health topic XML files contain records of health topics. Each health topic record includes data elements associated with that topic.
  • This dataset contains a large set of e-mail messages from the Enron corpus which has about 150 users with an average of 757 messages per user; the dataset is in AVRO format and we have converted it to JSON format for the purpose of this book.

    Note

    This dataset is downloaded from the link https://s3.amazonaws.com/rjurney_public_web/hadoop/enron.avro.

  • Manufacturing dataset for electrical appliances is a synthetic dataset created for the purpose of this book. This dataset contains the following files:
    • manufacturing_units.csv: This contains information about each manufacturing unit
    • products.csv: This contains details of the products that are manufactured
    • manufacturing_units_products.csv: This holds detailed information of products that are manufactured in different manufacturing units
    • production.csv: This holds the production details
  • The unstructured text dataset contains parts of articles from Wikipedia on Computer science and Information Technology, Big Data, Medicine, invention of telephone, stop words list, and dictionary words list.
  • The Outlook contacts dataset is a synthetic dataset created by exporting the Outlook contacts for the purpose of this book; it is a CSV file with attributes contact names and job titles.
  • The German credit dataset in CSV format classifies people as good or bad credit risks based on a set of attributes. There are 20 attributes (7 numerical and 13 categorical) with 1,000 instances.

    Note

    This dataset is downloaded from the link http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data.

    Acknowledgement: Data collected from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)), source: Professor Dr. Hans Hofmann, Institut fuer Statistik und Oekonometrie, Universitaet Hamburg.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.