Alex Rodrigues

Data science, distributed systems and big data

Probe Your System With Synthesized Realistic Data

I wonder how ancient civilizations, that have built incredible structures, knew that once built they would remain steady. I’m not talking about mysterious pyramids but more recent landmarks left us by romans such as roman aqueducts. How sure were they on the resiliency to nature powers, strong winds and tough winters? Well… the answer seems to be easy: with lots of theoretic work, designs and calculations.

While this is true, things were also achieved with lots of experimentation, both on the structural side and on materials applications. That knowledge was vital to have good practices on what to use in each situation and predict maintenance and consequences in extreme conditions. The same applies to the conception of fault-tolerant reliable systems to process large volumes of data.

Normal approaches start with mere reasonable assumptions on data volume, ingestion rate, document size, but others are hard to predict such as processing time, indexing time, etc. It really helps to know how components behave under pressure and calculate what are the limits of the current design.

Many (self-entitled) architects like to take the designing process as cooking a big blend of hype-based technologies and it’s very easy to get burnt. Key factors such as SLAs, peak times and hardware limitations greatly affect which components you choose to put in the pan and how would you mix them together.

The only way to know how certain software components behave, they have to be exercised with a great volume of data, similar to the one they will process once in production. Data that can be just fed from the productions servers if it exists and if the infrastructure allows. More often than not, the data is not yet available in the target formats and there’s the necessity of trying out with generated data in the chosen formats.

Whenever I want to test and benchmark the systems I’m working on, I use a synthetic data generator called log-synth. Log synth is one of those swiss-army knives for data generation. It has plenty of generators, based on parametric statistical methods. The good part is that it’s open-source and can be easily extended with new generating algorithms and output formats.

The most common output formats are JSON and CSV.

Recently I’ve added a template based generator that extends the range of available output formats. It leverages the data generation algorithms that ship with it and feeds that into a templated document using Freemarker templating language.

To generate a sample vCard simply create a file template.txt:

template.txt – template for a vCard document.
1
2
3
4
5
6
7
8
9
10
11
BEGIN:VCARD
VERSION:3.0
N:${last_name.asText()};${first_name.asText()};;${title.asText()}
ORG:Sample Org
TITLE:${title.asText()}
PHOTO;VALUE=URL;TYPE=GIF:http://thumbs.example.com/${filename.asText()}/${first_name.asText()?lower_case  }.gif
TEL;TYPE=HOME,VOICE:${phone_number.asText()}
ADR;TYPE=WORK:;;${address.asText()?split(" ")?join(";")}
EMAIL;TYPE=PREF,INTERNET:${first_name.asText()[0]?lower_case}${last_name.asText()?lower_case}@example.com
REV:${first_visit.asText()}
END:VCARD

Then a schema file, let’s call it schema.txt:

schema.txt – schema for vCard document fields.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[
    {"name":"title", "class":"string", "dist":{"Mr":0.5, "Mrs.":0.14, "Miss":0.36}},
    {"name":"first_name", "class":"name", "type":"first"},
    {"name":"last_name", "class":"name", "type":"last"},

  {"name": "filename", "class": "join", "separator": "/", "value": {
          "class":"sequence",
          "length":2,
          "array":[
              {"class":"string", "dist":{"small":10, "medium":5, "large":2}},
              {"class":"string", "dist":{"high":10, "low":5, "mobile":15}}
          ]
      }
  },

  {"name": "phone_number", "class": "join", "separator": "-", "value": {
          "class":"sequence",
          "length":3,
          "array":[
              { "class": "int", "min": 100, "max": 999},
              { "class": "int", "min": 100, "max": 999},
              { "class": "int", "min": 100, "max": 999}
          ]
      }
  },

    {"name":"address", "class":"address"},
    {"name":"first_visit", "class":"date", "format":"yyyy-MM-dd HH:mm:ssZ"}
]

To invoke the log-synth, just do:

java -cp .:./target/log-synth-0.1-SNAPSHOT-jar-with-dependencies.jar com.mapr.synth.Synth -count 5000 -schema schema.txt -template template.txt -format TEMPLATE -output output/

The output documents will end up in the output/ folder as expected and they will look like:

Sample generated vCard document.
1
2
3
4
5
6
7
8
9
10
11
BEGIN:VCARD
VERSION:3.0
N:Kittle;Gwendolyn;;Mr
ORG:Sample Org
TITLE:Mr
PHOTO;VALUE=URL;TYPE=GIF:http://thumbs.example.com/small/mobile/gwendolyn.gif
TEL;TYPE=HOME,VOICE:774-383-580
ADR;TYPE=WORK:;;18033;Quaking;Brook;Avenue
EMAIL;TYPE=PREF,INTERNET:[email protected]
REV:2013-07-14 01:37:08+0100
END:VCARDBEGIN:VCARD

I invite you to explore this very handy tool on Github.

Comments