Preface

Data science is a diverse and growing field encompassing many subfields of both mathematics and computer science. Statistics, linear algebra, databases, machine intelligence, and data visualization are just a few of the topics that merge together in the realm of a data scientist. Technology abounds and the tools to practice data science are evolving rapidly. This book focuses on core, fundamental principles backed by clear, object-oriented code in Java. And while this book will inspire you to get busy right away practicing the craft of data science, it is my hope that you will take the lead in building the next generation of data science technology.

Who Should Read This Book

This book is for scientists and engineers already familiar with the concepts of application development who want to jump headfirst into data science. The topics covered here will walk you through the data science pipeline, explaining mathematical theory and giving code examples along the way. This book is the perfect jumping-off point into much deeper waters.

Why I Wrote This Book

I wrote this book to start a movement. As data science skyrockets to stardom, fueled by R and Python, very few practitioners venture into the world of Java. Clearly, the tools for data exploration lend themselves to the interpretive languages. But there is another realm of the engineering–science hybrid where scale, robustness, and convenience must merge. Java is perhaps the one language that can do it all. If this book inspires you, I hope that you will contribute code to one of the many open source Java projects that support data science.

A Word on Data Science Today

Data science is continually changing, not only in scope but also in those practicing it. Technology moves very fast, with top algorithms moving in and out of favor in a matter of years or even months. Long-time standardized practices are discarded for practical solutions. And the barrier to success is regularly hurdled by those in fields previously untouched by quantitative science. Already, data science is an undergraduate curriculum. There is only one way to be successful in the future: know the math, know the code, and know the subject matter.

Navigating This Book

This book is a logical journey through a data science pipeline. In Chapter 1, the many methods for getting, cleaning, and arranging data into its purest form are examined, as are basic data output to files and plotting. Chapter 2 addresses the important concept of viewing our data as a matrix. An exhaustive review of matrix operations is presented. Now that we have data and know what data structure it should take, Chapter 3 introduces the basic concepts that allow us to test the origin and validity of our data. In Chapter 4, we directly use the concepts from Chapters 2 and 3 to transform our data into stable and usable numerical values. Chapter 5 contains a few useful supervised and unsupervised learning algorithms, as well as methods for evaluating their success. Chapter 6 provides a quick guide to getting up and running with MapReduce by using customized components suitable for data science algorithms. A few useful datasets are described in Appendix A.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This element signifies a tip or suggestion.

Note

This element signifies a general note.

Caution

This element indicates a warning or caution.

Using Code Examples

Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/oreillymedia/Data_Science_with_Java.

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Data Science with Java by Michael Brzustowicz (O’Reilly). Copyright 2017 Michael Brzustowicz, 978-1-491-93411-1.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

O’Reilly Safari

Note

Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.

Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.

For more information, please visit http://oreilly.com/safari.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

  • O’Reilly Media, Inc.
  • 1005 Gravenstein Highway North
  • Sebastopol, CA 95472
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515 (international or local)
  • 707-829-0104 (fax)

To comment or ask technical questions about this book, send email to .

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

I would like to thank the book’s editors at O’Reilly, Nan Barber and Brian Foster, for their continual encouragement and guidance throughout this process.

I am also grateful for the staff at O’Reilly: Melanie Yarbrough, Kristen Brown, Sharon Wilkey, Jennie Kimmel, Allison Gillespie, Laurel Ruma, Seana McInerney, Rita Scordamalgia, Chris Olson, and Michelle Gilliland, all of whom contributed to getting this book in print.

This book benefited from the many technical comments and affirmations of colleagues Dustin Garvey, Jamil Abou-Saleh, David Uminsky, and Terence Parr. I am truly thankful for all of your help.

Get Data Science with Java now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.