Kubernetes with Argo for the Win

Kubernetes is a generally untapped compute resource. It is that unfamiliar territory between IaaS and PaaS. Many engineers are apprehensive of using it, IT management is fearful of adding it to the stack, and technologists keep begging for it to be used — but why?

  • Another abstraction for pipeline code deployment to engineers is another tech the ‘company’ requires them to learn.
  • It is another piece of tech to support management — which means more stuff & more debt.
  • To the technologist, those are boneheaded thoughts (yes, every technologist has a colossal ego) — there is no reason to hold…


Open Source Tooling to Move Your Data Warehouse to the Cloud

by John Aven and Jhimli Bora

Choosing the right tools can be difficult when planning to move data to the cloud or any migration of a large data asset or assets. Additionally, the tools chosen to transport data in your new system may not support your current data systems — especially in the capacity of a migration. The Hashmap team has built an open-source tool aimed to fill that temporary gap.

About

Hashmap Data Migrator, or hdm (yes — lowercase), is an open-source software from Hashmap. It is Apache 2.0 licensed. hdm’s goal is to assist in moving data from your…


Working with GitLab

Here, at Hashmap, we use GitLab as one of our many version control platforms. Yes, we employ more than one — we are a consulting firm — and as a consulting firm, it pays to be familiar with a plethora of the tools available out there. However, the core of our work is done with GitLab for some fundamental reasons we will discuss below. That aside, we’d like to talk about the basics of using GitLab and some of the cool features.

The Very Basics

GitLab is, well, at the base, just another git repository hosting service. As soon as you venture away…


Dockerizing Your Code

Machine learning continues to evolve. While many folks are moving towards more modern approaches, there is still an apprehension around using Docker and Docker related technologies in much of data science. Making the change and using a containerized solution space like Docker is not necessarily an easy step to take — much like moving your data science workloads to the cloud wasn’t (and maybe you still haven’t taken that step).

Docker, often called a containerization technology, uses an abstraction of virtualization that reuses the host system Linux kernel to package and run an application in a platform-independent immutable deployment unit…


What is a DAG?

In modern computing solutions, the concept of a DAG or Directed Acyclic Graph is central. While the term DAG has become quite the buzz word: understanding what they are, how they are used in computing, and how/where they show up in data science and machine learning is not just buzz. In short, a DAG describes a sequence of execution steps in the complex non-recurring computation.

How often do you come across the need to create a DAG in machine learning?

Every. Single. Day.

Machine learning can be constructively defined to be “the art of building DAGs to treat and transform…


And You Should Too

Whether you are a data scientist, an ML engineer, a data analytics manager, a compliance officer, or the CEO (or anyone else whose job depends or will eventually) — you need to know what MLflow is and why you should be using it.

When a data scientist builds a model, they go through a lot of experimentation; some even intentionally apply the scientific method. But unlike most scientists — or rather, exactly like most scientists — their record-keeping habits are generally shameful. I can tell you this from experience. When it comes to any computational science, these bad habits lead…


#3 in the Evolving Data Science Series

Welcome back to the Evolving Data Science series. If you haven’t already read the previous entries, please find them below and read them at your leisure. There is no ‘ordering’ in this series. Today we want to talk about how you as a company, as agents of change, can scale up your data science processes and reach a new scale.

Code Repo Variance

Some of the inherent complexity found in data science comes from the number of different steps that can be taken in any given solution. This generally results in code repositories that vary to a great degree. These code repositories can…


Hadoop is Not a Cloud Data Warehouse

Spark has been the answer to processing all big data in memory but is that still true? In the world of big data, Spark was the data engineer’s answer to solving big data needs that Hadoop demanded. Loading large amounts of data into memory and performing SQL and dataflow type operations on the data was a major requirement. Spark’s predecessor, MapReduce, required data to be read and written to disk many times.

However, as Snowflake has taken on a much larger part of the overall data and analytics market, we need to reevaluate the place of Snowflake in the stack.

Is Spark a Long-Term Solution?


That Wasteful ‘Parquet’ Layer

Those of us working within the big data space have been trained to build out various layers within a ‘data lake’. These layers, usually three to five, typically consist of the following (common alternate naming conventions included):

  1. Raw/Landing/Stage
  2. Standardized/Normalized
  3. Cleansed/Prepared
  4. Modeled/Provisioned
  5. Published/Reporting/Consumption/BI

Occasionally, some of these layers get combined depending on the architect and the business requirements. Within a Data Lake, these layers have a purpose. These match, equivalently, to the various stages you find in data warehousing solutions. They served a purpose in the data lake space but do they make sense when working with a Cloud Data Warehouse…


Addressing Realtime Data Sources

Streamsets is one of the friendlier EL, e.g. data acquisition, tools to use. It has a friendly and intuitive user interface. This article is not here to discuss the pros and cons of Streamsets or any tool in its class. Instead, we are going to consider its use as a vehicle for moving data from a remote data source and placing it in Snowflake — the most dominant, and for a good reason, Cloud Data Warehouse (CDW). …

John Aven

“I’d like to join your posse, boys, but first I’m gonna sing a little song.”

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store