The Data Janitor Letters - June 2020

Data engineering salon. News and interesting reads about the world of data.

I gave the business what they asked for and they never used it
Kenny Ning, Data Engineer, Better

An addition to the long list of unused ML projects: do the simplest solution first, modeling is never done, failure is okay.

Getting machine learning to production
Vicki Boykis, Machine Learning Engineer, automattiC

Deploying is hard. Deep learning is deceptively easy. Go for prebuilt as much as possible. Understand networking and scale. Iterate quickly.

AI – the no bullshit approach
Filip Piekniewski, Scientist, Accel Robotics

Common sense will emerge only when a connectionist like system will have a chance to develop the internal symbols to represent the relationships in physical world.

What I learned from looking at 200 machine learning tools
Chip Huyen, Engineer, Snorkel AI

If you have to choose between engineering and ML, choose engineering. It’s easier for great engineers to pick up ML knowledge, but it’s a lot harder for ML experts to become great engineers.

Most tech content is bullshit
Aleksandra Sikora, Software Engineer, Hasura

Realize that there's tons of misconception in the world. Adapt solutions to your particular use case. Your solutions are not any worse than the ones on the internet.

How we migrated our data warehouse from Redshift to BigQuery
Rahul Jain, Principal Engineering Manager, Data engineering and BI platform, Omio/GoEuro

A journey step-by-step.

In search of speed — debugging Elasticsearch performance
Martin Iotchev, Fullstack Software Engineer, Prezi

The underlying hardware plays a significant role in the performance of an Elasticsearch cluster. Provisioning larger data nodes will yield better performance as compared to the smaller default nodes currently used in production. Furthermore, a cluster with more shards will perform better on larger data sets.

Lightweight alternatives to Google Analytics
Ben Hoyt, Software Engineer, Compass

For site owners who just need basic traffic numbers, GoatCounter and Plausible both seem like excellent options. Those who like more visual polish and documentation might prefer Plausible; those who value a more developer-friendly tool with easy self-hosting will probably prefer GoatCounter.