Chapter 7. A Detailed Look at log-synth

Log-synth is open source software for generating synthetic data that can mimic the performance of real data, useful especially in situations involving restricted access to sensitive data. This chapter is a detailed technical description of the general purpose and implementation of log-synth, and it should be considered as a how-to guide more than a conceptual discussion. As such, the chapter has some overlap with the technical descriptions of the specific use cases covered in Chapter 5 and Chapter 6 but also goes beyond those examples.

For convenience, here is a link to the Github repository where we make log-synth freely available for your use. This repository also contains pre-packaged samplers and some documentation: https://github.com/tdunning/log-synth.

Goals

As a package, log-synth has fairly simple goals:

  • Facilitate the creation of realistic random data by non-specialists

  • Be fast enough to generate big data–scale datasets quickly

  • Allow schemas to be defined that combine various building blocks flexibly

  • Make it easy to extend log-synth with new samplers

  • Keep the system and the user experience really simple

In order to meet these goals, log-synth has been designed with a minimalist point of view in terms of overhead and structure, but with a very generous attitude toward the variety of built-in samplers. These goals have meant that while log-synth contains a wide variety of primitive generators for things like names, addresses, ...

Get Sharing Big Data Safely now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.