Tag: Data

How I Got Into Data Engineering

In the world of data, we often hear tales of high-flying graduates with degrees in Computer Science landing impressive gigs at tech giants. But let’s flip the script. Here I am, a Data Engineer who found his purpose in a health-tech company, with a somewhat unconventional pathway into the fiel...

ChatGPT’s Code Interpreter Was Just Released. Here’s How It Will Change Data Science Forever.

The best ChatGPT plugin was just released — code interpreter. This plugin allows us to upload data, write/execute Python code, do data analysis, generate reports, and even download the code and reports generated in .ipynb and pdf format respectively. The best part? All this is done in secon...

A Data Science Project with ChatGPT Code Interpreter

As someone who is currently juggling a full-time data science job with multiple freelance projects, I am usually the first to try tools that can potentially decrease my turnaround time. When ChatGPT started rolling out the Code Interpreter plugin to subscribers in the past week, I couldn’t ...

ChatGPT used for Financial Data — Filters, Plots, and a Streamlit App

As a former portfolio manager, I always thought there had to be a better way to store and compare financial data other than Excel spreadsheets. Long hours of maintaining single company models and copying/pasting/index/matching data between my models for comp screens and such led me to long for somet...

How to Create Valuable Data Tests

Data quality has been widely discussed over the past year. The increasing adoption of data contracts, data products, and data observability tools certainly shows data practitioners’ commitment to providing high-quality data to their consumers. We all love to see this! One essential building...

Demystifying Data Science: Part 1

Introduction Data science is a dynamic and expansive field that has revolutionized industries across the globe. In this article, we will embark on a journey to demystify data science by providing an overview of its key concepts and the underlying process. By understanding the fundamental concepts...

Master Data Analysis using Trustworthy Databases

Python is an excellent programming language for beginners to handle and analyze data. We’ll also talk about how to find relevant data sources and a bit about statistical testing. Let’s break this down into a few steps: Step 1: Finding Relevant and Trustworthy Databases Depending on...

What Exactly Does a Data Scientist Do?

As this smorgasbord of job descriptions shows, it can be really hard to get a clear picture on what a Data Scientist role actually involves day-to-day. Lots of the existing articles out there — while excellent — date from 2012–2020, and in a field that evolves as fast as Data ...

Harnessing the Power of Knowledge Graphs: Enriching an LLM with Structured Data

In recent years, large language models (LLMs), have become ubiquitous. Perhaps the most famous LLM is ChatGPT, which was released by OpenAI in November 2022. ChatGPT is able to generate ideas, give personalized recommendations, understand complicated topics, act as a wr...

Data Mastery with Python and SQL: Unleashing Efficiency and Security through 4 Strategic Use Cases

Introduction Data analysis and management are essential components of any modern enterprise’s operations. To effectively harness the power of data, professionals rely on a combination of programming languages and tools that enable efficient data processing, manipulation, and analysis. In th...

Why Data Scientists and Engineers Quit Their Jobs

A recent study of data scientists made the following quite stark conclusion: “The typical data scientist works for a large tech firm — where they have been employed for roughly one year with an average of 6.2 years of prior experience in the field. Notably, they have switched c...

My Data Analyst Interview at a Finance Company: 6 Questions and Answers.

Interview Experience for a Data Analyst Role at a Finance Sector Company I recently went through an interview process for a data analyst role at a finance sector company, and I would like to share my experience with you. The interview consisted of several technical questions which were quite easy...

3 Ethical Dilemmas in Data Science You’ve Likely Overlooked

“As I trudged through the massive data swamp, an ethereal voice whispered to me, ‘With great data, comes great responsibility.’” Yes, my friends, it was no other than the Ghost of the Internet Future (yes, she’s a thing), forewarning about the immense ethical complex...

Different Ways of Storing Data in the Browser

Web applications often require some form of data storage to function effectively. In some cases, this data needs to persist between different user sessions, while at other times it needs to be kept only while the page is loaded. Several options are available for storing data directly in the user&...

Python in Excel Will Reshape How Data Analysts Work

As Microsoft said, this is a significant evolution in the analytical capabilities available within Excel. They want to combine the power of Python with the flexibility of Excel. The best of both worlds! With this integration, you can write Python code in Excel cells, create advanced visualization...

How to Create First Data Engineering Project? An Incremental Project Roadmap

People often fail to remain consistent while learning so many different technologies. They fail to piece it together. The roadmap should have addressed the challenge of this dwindling attention span. We will try to explore how an Incremental Project Roadmap can address this gap. The Data Engineer...

What you won’t learn from books about data and decision-making

My community has been asking me for a reading list of my favorite books about decision-making, data science, and decision intelligence, so here are the fruits of my attempt to compile some recommendations for you. Photo by the author. I’d love to suggest one great book for ...

How I Would Learn Data Science with ChatGPT If I Could Start Over

Learning how to learn is one of the most useful skills you can cultivate. When I first started teaching myself programming and data science in 2018, I enrolled into countless online courses. Every time I completed a course and got a certificate, I’d get a momentary feeling of accomplishment...

How I Got Into Data Engineering

In the world of data, we often hear tales of high-flying graduates with degrees in Computer Science landing impressive gigs at tech giants. But let’s flip the script. Here I am, a Data Engineer who found his purpose in a health-tech company, with a somewhat unconventional pathway into the fiel...

10 Data Analysis Projects to land your first Job.

To land a Job in a very technical field such as Data analysis you need to have a solid portfolio, This portfolio will be your identity Card that you will introduce yourself to the world through it. So Here are 10 entry-level projects you can test your limits & skills through these Ideas, wher...

How to Quietly Upgrade Yourself as a Data Engineer (while Working 9 to 5)

When I began my data journey, I thought it would be sunshine and rainbows all the way to the top. I believed (naively) everything would make sense, things would be easy, and all the puzzle pieces would fit perfectly in place as the years went by. The data world doesn’t work that way. There ...

What you won’t learn from books about data and decision-making

My community has been asking me for a reading list of my favorite books about decision-making, data science, and decision intelligence, so here are the fruits of my attempt to compile some recommendations for you. Photo by the author. I’d love to suggest one great book for ...

Introduction to Data Version Control

What is Data Version Control (DVC)? Any production-level system requires some kind of versioning. A single source of current truth. Any resources that are continuously updated, especially simultaneously by multiple users, require some kind of an audit trail to keep track of all changes....

How to Run Llama 2 on Mac M1 and Train with Your Own Data

Llama 2 is the next generation of large language model (LLM) developed and released by Meta, a leading AI research company. It is pretrained on 2 trillion tokens of public data and is designed to enable developers and organizations to build generative AI-powered tools and experiences. Llam...

A Very Dangerous Data Science Article

A few years ago an article was published in Harvard Business Review online called Prioritize Which Data Skills Your Company Needs with This 2×2 Matrix. Now, HBR is well known for its articles on business strategy, but it is not really a market leader in technical content, and this ar...

The building blocks of Faire’s Data Team onboarding

As Faire’s marketplace has experienced incredible growth over the last few years, so has the size and scope of our Data team. As we’ve delivered more impact by building additional machine learning models, running more experiments, and evolving our tooling and infrastructur...

Registering refugees using personal information has become the norm — but cybersecurity breaches pose risks to people giving sensitive biometric data

The number of refugees worldwide reached record high levels in 2022. More than 108.4 million people have been forced to flee their homes because of violence or persecution. Meanwhile, governments and aid agencies are increasingly using a controversial method of effectiv...

Langchain 101: Extract structured data (JSON)

Based on the medium’s new policies, I am going to start with a series of short articles that deal with only practical aspects of various LLM-related software. Photo by Marga Santoso on Unsplash The Tutorial In this tutorial, we will learn how to extract structured d...

Writers Once Had a Muse, Now They Have AI Tools

Whether we realize it or not, we spend all of our time gathering and processing data. Some of the data is good and actionable. Some of the data is misleading. It’s our job to make the best choices we can based on the information we have. When it comes to digital writing tools, the advanceme...

Data: A Hoarder’s Storage Locker, Not a Magical Museum

There’s a common misconception that data is the next best thing to a holy relic of science — objective, mathematical, clean, correct, and above all, always useful. A more accurate analogy for data would be a hoarder’s storage locker. If you’re like most people, you ...

Data Objects in Kotlin

Data objects are a new Kotlin language feature introduced in version 1.7.20 and are currently planned to be released in version 1.9. We’ll take a closer look at what they are and what issue they are trying to solve. What issue are data objects solving? Below we have a typi...

Best Data Analysis Library in Python

Imagine you have a bunch of data in Jupyter Notebook that you want to analyze and visualize. PyGWalker is like a magic tool that makes it super easy for you. It takes your data and turns it into a special kind of table that you can interact with, just like using Tableau. You can explore yo...

How I Would Learn Data Science with ChatGPT (If I Could Start Over)

Learning how to learn is one of the most useful skills you can cultivate. When I first started teaching myself programming and data science in 2018, I enrolled into countless online courses. Every time I completed a course and got a certificate, I’d get a momentary feeling of accomplishment...

Using Retrofit to upload file and some data on server in Android App.

During my recent project, I utilized the Retrofit library in my App to efficiently handle file uploads alongside additional data fields. While implementing this functionality, I encountered some challenges that I want to share with you. I would like to begin with Retrofit library, Retrofit is a p...

Understanding Entropy made me a better data scientist

I remember several years ago when I was reshaping my career from finance into data science and being fascinated about how the book Data Science for Business (Provost & Fawcett) introduced the concept of Entropy in their classification examples, so elegantly, so powerful yet s...

Inspecting Data Science Predictions: Individual + Negative Case Analysis

Somewhere around 40 to 43% of the time when I am showing new learners how to use the .predict() methods I get the following question: Where are the predictions? I wish this was a question learners would ask more often. It is an insightful question, especially for folks who are newer ...

10 Data Analysis Projects to land your first Job.

To land a Job in a very technical field such as Data analysis you need to have a solid portfolio, This portfolio will be your identity Card that you will introduce yourself to the world through it. So Here are 10 entry-level projects you can test your limits & skills through these Ideas, wher...

12 Python Features Every Data Scientist Should Know

As a data scientist, you’re no stranger to the power of Python. From data wrangling to machine learning, Python has become the de facto language for data science. But are you taking advantage of all the features that Python has to offer? In this article, we’ll take a deep dive into...

Stop using Data Transfer Objects (DTOs) in your code

Goals and objectives vary a lot depending on your organization's goals and constraints. When I was a consultant, I was essentially “outsourced” to help with a particular project for a short period of time. You get exposed to very diverse projects and different ways of doing things. ...

Why Data Quality Is Harder than Code Quality

As a data engineer, I always feel less confident about the quality of data I handle than the quality of code I write. Code, at least, I can run it interactively and write tests before deploying to production. Data, I most often have to wait for it to flow through the system and be used to encounter ...

A Very Dangerous Data Science Article

A few years ago an article was published in Harvard Business Review online called Prioritize Which Data Skills Your Company Needs with This 2×2 Matrix. Now, HBR is well known for its articles on business strategy, but it is not really a market leader in technical content, and this ar...

Cracking the Code of Business Success: How Data Analysis is the Ultimate Productivity Potion

In the ever-evolving landscape of business, the pursuit of success has taken on new dimensions. Gone are the days when sheer hard work alone could guarantee prosperity. Today, a smarter approach is essential — one that leverages the power of data analysis to unearth insights, drive strategic d...

Three Simple Things About Regression That Every Data Scientist Should Know

I consider myself more of a mathematician than a data scientist. I can’t bring myself to execute methods blindly, with no understanding of what’s going on under the hood. I have to get deep into the math to trust the results. That’s a good thing because it’s very easy nowaday...

World Class Data Scientist Says This Cycle Is Your Last Chance To Make Generational Wealth From Crypto (While It’s New and Inefficient)

He’s a data scientist who started developing games and websites at the age of nine, thanks to the support from his mathematician mom. Like many of us, his most significant lesson in how the market worked was FOMO’ing (Fear of Missing Out) at the top of a bull run and then getting wipe...

Real-time Data Synchronization

In today’s microservices architecture, event-driven communication has transitioned from being a luxury to a necessity. The traditional approach of using point-to-point communication between microservices is no longer sufficient to meet the evolving demands of modern applications. Event-driven ...

Real-Time Data Synchronization in ASP.NET Core Using the Outbox Pattern

Real-time data synchronization has become a critical requirement for modern web applications. In this comprehensive guide, we’ll explore real-time data synchronization using the Outbox Pattern in ASP.NET Core. From understanding the fundamentals to implementing practical example...

This once hot data trend just reared its ugly head again — strap on in

If you ever wondered why the data space has so much noise going on in it, you need look no further than the millions of dollars put into the industry. This is, in some ways, a sign of the times. As well as see VC spend on Data products increase, the spend on those products themselves has increased t...

5 Things I Wish I Knew Before My First Job as a Data Analyst

I’m five months into my first Data Analyst job, and there are many surprises I wish someone had warned me about! I will list them below, so you are better prepared than I was. 5 things I wish I knew before starting my first Data Analyst job 1. If you’re the only tech person on the ...

DATA SCIENCE : A 30 day learning experience

From starting teaching to students when I was is 10th grade , coming to college and making friends over teaching and sharing my insights on the subject , my love for sharing my experience grew ten fold and it made my process of learning a new skill, a new subject , more fun and made it possible for ...

Geospatial Data Engineering: Spatial Indexing

Intro: why is a spatial index useful? In doing geospatial data science work, it is very important to think about optimizing the code you are writing. How can you make datasets with hundreds of millions of rows aggregate or join faster? This is where concepts such as spatial indices come in. In th...

Automate the exploratory data analysis (EDA) to understand the data faster and easier

What is EDA? EDA is one of the most important things we need to do as an approach to understand the dataset better. Almost all data analytics or data science professionals do this process before generating insights or doing data modeling. In real life, this process took a lot of time, depend...

Great Applied (Data) Science Work

Advanced data science work in industry is sometimes also known as “applied science,” reflecting the reality that it’s about more than just data and that many former academics work in the field. I find that “applied science” has different expectations than research scien...

Learn Next.js By Building Your First Next.js App From Scratch

While this post will help you build a Next.js app from scratch, you should have some knowledge of HTML, CSS, JavaScript, React, and related web development concepts before jumping into it. Second part Build A Weather App On Next.js Workshop Workshop If you’d like to join a fr...

Data Visualisation Mind-Map

In the ‘Marks.csv’ file, you can find the scores obtained by 200 students in 4 subjects of a standardised test. The different columns — Score A, Score B, Score C and Score D indicate the score obtained by a particular student in the respectiv...

Archetypes of the Data Scientist Role

After the positive responses to my recent post in Towards Data Science about Machine Learning Engineers, I thought I would write a bit about what I think are the real categories of roles for data science practitioners in the job market. While I was previously talking about the candidates, e.g. ...

Fetch Data in React JS in modern way

There are several ways to fetch data in a React application. Here are some of the most common: Fetch API : The Fetch API is a built-in browser API for fetching resources, including data from a server. It returns a Promise that resolves to the response object. You can use the Fetch API in combi...

Best Data Analysis Library in Python

Imagine you have a bunch of data in Jupyter Notebook that you want to analyze and visualize. PyGWalker is like a magic tool that makes it super easy for you. It takes your data and turns it into a special kind of table that you can interact with, just like using Tableau. You can explore yo...

My Data Analyst Interview at a Finance Company: 6 Questions and Answers.

Interview Experience for a Data Analyst Role at a Finance Sector Company I recently went through an interview process for a data analyst role at a finance sector company, and I would like to share my experience with you. The interview consisted of several technical questions which were quite easy...

Web Storage API: A Developer’s Guide to Browser Data Storage

When we see HTML, we see a bunch of tags, elements, and lots of angle brackets but HTML is much more than these things. In recent years of web development, it has emerged as a cornerstone for creating interactive and dynamic web applications. At the root of this innovation lie HTML5 APIs, which prov...

Should We Be More Data-Driven? Sometimes.

I was working as a data scientist at Airbnb when Covid-19 struck. And as you might expect, Covid-19 was a special kind of brutal for a business that relied on good faith human-to-human interaction. When the world is forming insular social pods, it’s going to be hard to get anyone to stay ...

ChatGPT for Data Analysts (Part 1)

OpenAI launched ChatGPT almost a year ago on November 30, 2022. Being a fan of artificial intelligence, I immediately started experimenting with this conversational agent, which is based on the latest GPT (Generative Pretrained Transformer) model. Just like millions of other users, I was quickly ...

7 Books to Be the Top Data Engineer

In today’s data-driven world, data engineering plays a pivotal role in transforming raw data into actionable insights. Aspiring data engineers often seek guidance and knowledge to master the essential skills required for success. While online resources and courses are abundant, the power of a ...

The building blocks of Faire’s Data Team onboarding

As Faire’s marketplace has experienced incredible growth over the last few years, so has the size and scope of our Data team. As we’ve delivered more impact by building additional machine learning models, running more experiments, and evolving our tooling and infrastructur...

Data Analyst Interview Experience

This was my first time giving a data analyst interview. It was expected to be around 30 minutes but extended up to 1 hour. (PS — I am a third-year computer science undergrad). So here it goes! Starting off with my introduction, the interviewer asked me about all the subjects I have i...

Create a GraphQL Query With a REST Endpoint As a Data Source

In this short article, let’s look at how we could create a simple graphQL query that fetches data from a RESTful endpoint. We’ll use Node.js and Apollo Server for this purpose. In this article, we’ll look at a simple Users resource as an example. The aim of this write-...

How To Manipulate Data

A large part of our modern society has cult-like faith in data. If you show a fancy-looking pie chart or a colorful scatter plot, people will find your statements three to five times more trustworthy (source). Ask anyone who has worked in a big corporation if you don’t believe me; they will...

A Very Dangerous Data Science Article

A few years ago an article was published in Harvard Business Review online called Prioritize Which Data Skills Your Company Needs with This 2×2 Matrix. Now, HBR is well known for its articles on business strategy, but it is not really a market leader in technical content, and this ar...

Inspecting Data Science Predictions: Individual + Negative Case Analysis

Somewhere around 40 to 43% of the time when I am showing new learners how to use the .predict() methods I get the following question: Where are the predictions? I wish this was a question learners would ask more often. It is an insightful question, especially for folks who are newer ...

How to Build a 5-Layer Data Stack

Like bean dip and ogres, layers are the building blocks of the modern data stack. Its powerful selection of tooling components combine to create a single synchronized and extensible data platform with each layer serving a unique function of the data pipeline. Unlike ogres, however, the cl...

Mastering the Art of Pricing Optimization — A Data Science Solution

1. Overview Pricing plays a very crucial role in the world of business. Making a balance between sales and margins is very important for the success of any business. How can we do it in the data science way? In this section, we will build the intuition of an effective data science solution for pr...

Seaborn charts that every Data Scientist Knows!

Data isn’t just numbers, but a narrative waiting to be uncovered. The world of data science has witnessed an astonishing surge, and I’ve had a front-row seat to this remarkable journey.  Seaborn is a Python library that allows us to plot graphs and plots that help us extra...

10 Things I Learned from Reading Fundamentals of Data Engineering

After two enriching years as a Data Engineer, I finally had the chance to dive into Fundamentals of Data Engineering written by the insightful minds of Joe Reis and Matt Housley. Reading this book inspired me to connect my data experience with its theoretical understanding. The book&rsq...

Apache Airflow: Custom Task Triggering for Efficient Data Pipelines

Apache Airflow is an indispensable tool for orchestrating data pipelines, making it a must-know tool for any data engineer in 2023. Like any tool, Airflow has its advantages and disadvantages. While it boasts excellent built-in functionality, there are situations where custom solutions are required ...

Flutter for data engineering and data science!

Flutter is Google’s SDK for crafting beautiful, fast user experiences for mobile, web, and desktop from a single codebase. According to a 2021 developer survey Flutter is the most popular cross-platform framework. Source: https://flutter.dev/ Now, thanks to Flet.d...

Scaling Agglomerative Clustering for Big Data

Agglomerative clustering is one of the best clustering tools in data science, but traditional implementations fail to scale to large datasets. In this article, I will take you through some background on agglomerative clustering, an introduction to reciprocal agglomerative clustering (RAC) based o...

PandasAI: Data Analysis with AI-Powered Simplicity

PandasAI is a super helper. It’s a mix of Pandas, a library that helps with data in Python, and AI, which is artificial intelligence. It has special tricks to make working with data easy. Whether you’re into numbers, words or anything, PandasAI is here to help. So, get ready as we exp...

Automating Rocket League data collection

I have returned with MORE data to improve my Rocket League (RL) gameplay. I took a new data collection and analysis approach, utilizing Python and Power BI, automating most of the manual collection. I will give a general overview of my data wrangling and my thoughts along the way. If you haven&rs...

Data Modeling in the Modern Data Stack

Data modeling is arguably the most impactful decision for a data team. It determines your architecture and the path that the whole team will follow. While this is not a new topic, the new tools and technology over the last decade have caused many to reconsider what’s best in a modern landscape...

Analyzing Geospatial Data with Python (Part 2 — Hypothesis Test)

In the first post, linked below, we worked with an introduction to Geospatial Data Analysis, where we downloaded the listings from AirBnb for the city of Asheville, in North Carolina (USA) and went through some steps to extract insights from geospatial data. Analyzing Geospatial Data wi...

Polars vs Pandas: Comparing Two Data Processing Libraries in Python.

Inthe realm of data science and analysis, processing and manipulating data efficiently is pivotal. Python, as one of the premier languages for data science, has an ever-evolving ecosystem of libraries tailored for data wrangling and analysis. Two of the standout libraries in this domain are Pandas a...

Overcoming The Final Hurdle of Data Automation With Fewer Failures

I’m the embodiment of the meme in which a developer spends hours automating a relatively simple task. In other words, while much of the world is increasingly apprehensive of replacing processes with AI, I’m still pro-automation. Image courtesy of starecat.com. And while I&...

Best Data Analysis Library in Python

Imagine you have a bunch of data in Jupyter Notebook that you want to analyze and visualize. PyGWalker is like a magic tool that makes it super easy for you. It takes your data and turns it into a special kind of table that you can interact with, just like using Tableau. You can explore yo...

5 ChatGPT plugins That Will put you ahead of 99% of Data Scientists

Today, more than ever, plug-ins are popular. They enhance the task you’ve done by using ChatGPT’s power. Also helps you save time. As I say, after a while, this proficiency will be added to the job descriptions. I already have seen ChatGPT in the job descriptions on Upwork; Exp...

The Path to Success in Data Science Is About Your Ability to Learn. But What to Learn?

Many great developments in data science have been made in the last decade but despite these achievements, many projects never see the light of day. As data scientists we must not only show strong technical skills but also understand the business context, effectively communicate with stakeholders, an...

How To Prepare Your Data For Visualizations

Want to get started on your next Data Visualization project? Start off by getting friendly with Data Cleaning. Data Cleaning is a vital step in any data pipeline, transforming raw, ‘dirty’ data inputs into those that are more reliable, relevant and concise. Data preparation tools such as...

Harnessing Flight Data: A Comprehensive Guide to Filtering with the AviationStack API

Air travel is a dynamic and rapidly changing industry. With thousands of flights taking off and landing daily, businesses, travelers, and aviation enthusiasts need real-time data to make informed decisions. The AviationStack API provides such data, and in this guide, we’ll delve into the myria...

Saas Company Data Analysis

The SaaS company is a company that is selling sales and marketing software to other companies (B2B). They have collected transactions data from their customer. They hire a data scientist to analyze the dataset so that they can gain more insight and improve company future perfomance. The dataset i...

Data Types in python

Hi there, Welcome to series three of our Python journey. In our previous lesson, we looked at variables and how to assign values to them. if you missed it, you can take a 3-minute read on it in my previous post. Today, we are going to serve ourselves with data types. Does the topic sound confusing? ...

Data Engineering Project — Online Retail Store

In this project, I will get into the shoes of a data engineer/BI developer working for an online retail service. The service allows users to browse their website and order items online. The managers at the company want us to provide a simple analysis of the data. To complete the assignment, we ga...

Don't Install Python for Data Science. Use Docker Instead!

Docker containers provide a lightweight and efficient way to package and deploy applications, making it easier to move them between different environments, such as development, testing, and production. However, while Docker is widely used for deployment, it has been underutilized by developers for t...

Functional programming in data engineering with Python — part 1

This is an introduction to a series on functional programming in data engineering using Python. Here I lay out some of the fundamental concepts and tools found in functional programming using Python code. What is functional programming? Functional programming is a declarative type of programmi...

Simplify Your Data Preparation With These 4 Lesser-Known Scikit-Learn Classes

Data preparation is famously the least-loved aspect of Data Science. If done right, however, it needn’t be such a headache. While scikit-learn has fallen out of vogue as a modelling library in recent years given the meteoric rise of PyTorch, LightGBM, and XGBoost, it’s still...

Top Data Analysis YouTube Channel In 2023

In the ever-evolving landscape of data analysis, YouTube has emerged as an invaluable resource. This article explores the leading Data Analysis YouTube channel in 2023, offering an essential guide for both beginners and seasoned analysts. Covering a diverse range of topics such as Mathematics, SQ...

Master Data Science with This Comprehensive Cheat Sheet

Data science is a rapidly growing field that combines statistics, mathematics, and computer science to extract insights and knowledge from data. As a data scientist, you need to be proficient in a variety of tools, techniques, and concepts to effectively analyze and visualize data. To help streamlin...

From Data Engineering to Prompt Engineering

Data engineering makes up a large part of the data science process. In CRISP-DM this process stage is called “data preparation”. It comprises tasks such as data ingestion, data transformation and data quality assurance. In our article we solve typical data engineering tasks using ChatGPT...

Exploratory Data Analysis: The Ultimate Workflow

Are you tired of starting from scratch every time you need to explore your data, without a clear roadmap? Look no further! I will guide you through a step-by-step process using Python to uncover valuable insights and trends hidden in your data. Whether you’re a beginner or an experienced da...

Data Engineering Project: Twitter Airflow Data Pipeline

Well, nowadays social media is abuzz with the legendary fight between Meta’s CEO Mark Zuckerberg and X’s owner Elon Musk. It has even escalated to the point of a cage fight between the two tech giants. Well we all know how that could turn out.   Musk Vs Zuckerberg Cage Fight ...

Taipy: a Tool for Building User-Friendly Production-Ready Data Scientists Applications

As a Data Scientist, you might want to create dashboards for data visualization, visualize data and even implement business applications to assist stakeholders in making actionable decisions. Multiple tools and technology can be used to perform those tasks, whether open-source or proprietary soft...

Unify Data Layer at Software Projects

Hello, Let’s start our article with some questions. … How many models (Holding data only without any logic) at your main working project right now? 100 or 1000 or 10000 or even 100000? Is the same quantity of models at other platf...

IIT Madras Data Science Degree: Is It Right for You? Let’s Explore through my experience!

Are you a high school student aspiring for IIT? Are you a college student disappointed not to have cracked IIT? Are you a graduate student striving for GATE and other IITs? Are you a working professional looking to earn a degree while working? Or perhaps, are you a foreigner keen on studying Data Sc...

From Actor to Data Engineer: An Interview with Jeff Vermeire

Welcome to the DET’s “From Anything to Data Engineering” blog series! At DET, we believe that everyone can build a data engineering career regardless of their background. In this blog series, we interview data engineers from various backgrounds and share their unique career journ...

5 Common Data Governance Pain Points for Analysts & Data Scientists

“Those data governance guys sure know how to make life interesting…” It’s time to cut them some slack. Drawing from my experience as an engineer and data scientist at one of Australia’s banking giants for half a decade, I’ve had the privilege of st...

Spring Data Commons implementation for HAPI FHIR (Part 4)

Now we are ready to implement more advance functionality. In our Repository Factory we override the query lookup strategy using our own implementation for FHIR queries (see the next section) Strategy: public class FHIRQueryLookupStrategy implements QueryLookupStrategy { private f...

Mastering the Game: A Year-Long Journey into Data-Driven Sports Trading with SportGPT

SportGPT has quietly marked an important milestone — one full year of guiding users through the labyrinthine world of sports trading. As we pause to reflect, it becomes essential to evaluate our performance against the backdrop of hard data. Our annual metrics are telling, and they suggest tha...

Missing Data Demystified: The Absolute Primer for Data Scientists

Earlier this year, I started a piece on several data quality issues (or characteristics) that heavily compromise our machine learning models. One of them was, unsurprisingly, Missing Data. I’ve been studying this topic for many years now (I know, right?!) but along some project...

Pandas 2.0: A Game-Changer for Data Scientists?

Due to its extensive functionality and versatility, pandas has secured a place in every data scientist’s heart. From data input/output to data cleaning and transformation, it’s nearly impossible to think about data manipulation without import pandas as pd, right? ...

7 Data Science Specialization Streams Most In-Demand Today

If you are one of my connections on LinkedIn or follow my blog, or have clicked this article based on the title, the chances are that you are pursuing a career in Data Science. You could be a Data Scientist, Data Engineer, Data Analyst, Machine Learning Engineer, or a fresh grad slash aspiring data ...

Improving Data Integration Test Performance with Mock

Have you ever wondered how Data Engineers test code components that involves data endpoints? Testing in data engineering typically starts with unit tests — checking if a single encapsulated function returns the expected results given input. Besides unit tests, integration tests are also use...

Data Entropy — More Data, More Problems?

“It’s like the more money we come across, the more problems we see” Notorious B.I.G Webster’s dictionary defines Entropy in thermodynamics as a measure of the unavailable energy in a closed thermodynamic system that is also usually considered to be a measure of the sy...

Inspecting Data Science Predictions: Individual + Negative Case Analysis

Somewhere around 40 to 43% of the time when I am showing new learners how to use the .predict() methods I get the following question: Where are the predictions? I wish this was a question learners would ask more often. It is an insightful question, especially for folks who are newer ...

[DDIA Reading Notes] Chapter 2 — Data Models and Query Languages

The book ‘Design Data Intensive Application’ is a quite popular book recently years. Why? The reason is that it is relatively ‘easy’ for the inexperience application software engineers to follow and get a glance of what distributed world looks like. This is my second time t...

A Simple Way to Improve Data Science Interviews

In this post I share a story about a mistake I made as an inexperienced Data Science hiring manager, and how it changed the way I conduct technical interviews. I also walk through an example Data Science interview prompt and show how stronger candidates approach the problem differently than weaker c...

Data Science Trends & Salaries in 2023

Data science is one of the coolest fields in recent years. Many people from different backgrounds have transitioned into this field. But, is this trend still ongoing? Today, we’ll handle the data science salaries 2023 dataset and explore trends in data science with data visualization techni...

Stay Consistent Learning Data Engineering — My 5 Strategies

It is challenging threading through this path. I find it difficult to stay focused and consistent.” “I have noticed you have been consistent in this whole data thing. How do you make it possible? I need some help!” “Learning Data Engineering is a long process and may ...

Why Data Is *Not* the New Oil and Data Marketplaces Have Failed Us

The phrase “data is the new oil” was coined by Clive Humby in 2006 and has been widely parroted since. However, the analogy holds merit in only a few aspects (e.g. the value of both usually increases with refinement) and data’s broader economic impact has been muted outsi...

On a great interview question

Between 2010 and 2019 I interviewed dozens of Software Engineer candidates at Google. Almost always I asked the same interview question. Moreover, this question happened to be on the banned list at Google, because it was publicly available on Glassdoor and other interview websites, but I continued t...

How to Chunk Text Data — A Comparative Analysis

Introduction The ‘Text chunking’ process in Natural Language Processing (NLP) involves the conversion of unstructured text data into meaningful units. This seemingly simple task belies the complexity of the various methods employed to achieve it, each with its strengths and weaknesses...

Words: the new data commodity

Culture as an industry is nothing new. Neither is the notion that data is ‘the new oil’. With the popularity of AI chatbots and large language models, we’re starting a new chapter of it through access to and the commodification of words. Words have become another type of data co...

Faster Data Experimentation With cookiecutter

In the long quest of owning my own data, I have created some services where I experiment with data and different tech stacks. They all have in common a few elements, and while I like to copy/paste since it’s my day-to-day job, I’ve decided to create a cookiecutter for myself. In this art...

How to Build an On-Call Culture in a Data Engineering Team

Atany company, one of the best ways to gain and retain customers is to deliver excellent services, meaning the service should be healthy and functional whenever the customers access it. To achieve this, the tech industry introduced on-call duty which was often associated with doctors in the past. ...

Master Data Science with This Comprehensive Cheat Sheet

Data science is a rapidly growing field that combines statistics, mathematics, and computer science to extract insights and knowledge from data. As a data scientist, you need to be proficient in a variety of tools, techniques, and concepts to effectively analyze and visualize data. To help streamlin...

The 3 Biggest Realisations I’ve Had About Data Engineering

If you work in data engineering, I think you can agree that you can easily apply that wisdom to a data engineer’s journey: when you think you’ve grasped one thing or solved one problem, something else turns up to flip it all on its head. Mountains, mountains, mountains — everywh...

Importance of choosing the right Data career!

As the world becomes more data-driven, many students/freshers are looking to enter the field of data and are confused about which career path to choose. The most popular options include Data Engineering, Data Analysis, Data Science, and Software Engineering. While all of these roles are related t...

Simplifying Data Management with Data Version Control (DVC)

Data Version Control, also known as DVC, is a robust tool designed to address the challenges of managing and tracking data in data science and machine learning projects. It integrates seamlessly with existing software engineering practices and is built on top of Git, which is a popular version contr...

5 Common Data Governance Pain Points for Analysts & Data Scientists

Are you an analyst or data scientist at a large organisation? Raise your hand if you’ve ever come across these head-scratchers: Finding data felt like going on a Sherlock expedition. Understanding data lineage was impossibly frustrating. Accessing data bec...

Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data

Over the past few weeks, I have been playing around with several large language models (LLMs) and exploring their potential with all sorts of methods available on the internet, but now it’s time for me to share what I have learned so far! I was super excited to know that Meta released the n...

7 Data Science Specialization Streams Most In-Demand Today

If you are one of my connections on LinkedIn or follow my blog, or have clicked this article based on the title, the chances are that you are pursuing a career in Data Science. You could be a Data Scientist, Data Engineer, Data Analyst, Machine Learning Engineer, or a fresh grad slash aspiring data ...

How I Would Learn Data Science with ChatGPT (If I Could Start Over)

Learning how to learn is one of the most useful skills you can cultivate. When I first started teaching myself programming and data science in 2018, I enrolled into countless online courses. Every time I completed a course and got a certificate, I’d get a momentary feeling of accomplishment...

Pandas 2.0: A Game-Changer for Data Scientists?

Due to its extensive functionality and versatility, pandas has secured a place in every data scientist’s heart. From data input/output to data cleaning and transformation, it’s nearly impossible to think about data manipulation without import pandas as pd, right? ...

Removing Unequal Data Distribution Bias from Datasets for Binomial Classification

In the realm of machine learning, achieving accurate and reliable results often hinges on the quality of the dataset being used. One common challenge that arises in binary classification tasks is unequal data distribution bias. When one class significantly outnumbers the other, the model tends to fa...

Missing Data Demystified: The Absolute Primer for Data Scientists

I’ve been studying this topic for many years now (I know, right?!) but along some projects I contribute to in the Data-Centric Community, I realized that many data scientists still haven’t fully grasped the full complexity of the problem, which inspired me to create this comprehensi...

Private GPT: Fine-Tune LLM on Enterprise Data

In the era of big data and advanced artificial intelligence, language models have emerged as formidable tools capable of processing and generating human-like text. Large Language Models like ChatGPT are general-purpose bots capable of having conversations on many topics. However, LLMs can also be fi...

A Simple Way to Improve Data Science Interviews

In this post I share a story about a mistake I made as an inexperienced Data Science hiring manager, and how it changed the way I conduct technical interviews. I also walk through an example Data Science interview prompt and show how stronger candidates approach the problem differently than weaker c...

10 Practices I Left Behind to Master the Art of Data Science

Hey there, fellow data enthusiasts! I’m Gabe A, and today, I want to take you on a journey through my data science career, highlighting the ten practices I’ve shed along the way to become the Python and data visualization expert I am today. Over the past decade, I’ve been fortunate...

Why Data Is Not the New Oil and Data Marketplaces Have Failed Us

The phrase “data is the new oil” was coined by Clive Humby in 2006 and has been widely parroted since. However, the analogy holds merit in only a few aspects (e.g. the value of both usually increases with refinement) and data’s broader economic impact has been muted outsi...

Understanding Entropy made me a better data scientist

I remember several years ago when I was reshaping my career from finance into data science and being fascinated about how the book Data Science for Business (Provost & Fawcett) introduced the concept of Entropy in their classification examples, so elegantly, so powerful yet s...

A Data Scientist’s Essential Guide to Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the single most important task to conduct at the beginning of every data science project. In essence, it involves thoroughly examining and characterizing your data in order to find its underlying characteristics, possible anomalies, and hidden pat...

HOW BIG DATA AND AI-ASSISTED DECISION MAKING CAN IMPROVE YOUR BUSINESS

In today’s fast-paced business landscape, data-driven decision-making has emerged as a crucial factor for success. Companies that harness the power of big data and employ AI-assisted decision-making processes gain a competitive edge because they understand their customers better. This blog exp...

Data Science Trends & Salaries in 2023

Data science is one of the coolest fields in recent years. Many people from different backgrounds have transitioned into this field. But, is this trend still ongoing? Today, we’ll handle the data science salaries 2023 dataset and explore trends in data science with data visualization techni...

Mixed Effects Machine Learning with GPBoost for Grouped and Areal Spatial Econometric Data

The GPBoost algorithm extends linear mixed effects and Gaussian process models by replacing the linear fixed effects function with a non-parametric non-linear function modeled using tree-boosting. This article shows how the GPBoost algorithm implemented in the GPBoost library&nbs...

Theoretical Deep Dive Into Linear Regression

Most aspiring data science bloggers do it: write an introductory article about linear regression — and it is a natural choice since this is one of the first models we learn when entering the field. While these articles are great for beginners, most do not go deep enough to satisfy senior data ...

A Guide to Real-World Data Collection for Machine Learning

Whether you’re brand new to data science or the Chief Data Scientist at a large organization, you’ve probably played with perfectly crafted data sets to solve toy machine learning problems. Maybe you’ve used K-Means clustering to predict flower species in the Iris data se...

How to Build a Data Science Portfolio Website With ChatGPT

As an entry-level data scientist, it can be challenging to break into the industry because the competition is at an all-time high. This is especially true if you don’t have a degree or any formal qualification highlighting your prowess in the subject. One piece of advice I’d always...

Role of Data Augmentation in Medical Image Analysis

Imagine being in the shoes of a radiologist, meticulously scanning through MRI images of a patient’s brain. You’re searching for any signs of a tumor — a task that demands extreme precision because the stakes are incredibly high. A wrong call could lead to improper treatment, or wo...

Deep Learning and Stock Time Series Data

Time series data is extremely prevalent in modern data science practice. One of the most visible examples of this is stock data, a time series that drives a great deal of modern economic life. In this post, we’re going to attempt to train a univariate deep learning model on time series and see...

Counterfactual Inference Using Time Series Data

Are you ever curious about the true impact of a new marketing campaign, product launch, new government policy, or some other event? Wouldn’t it be nice if you could compare the results of the event occurring versus never occurring at all? Well, with counterfactual inference using time series d...

Applying LLMs to Enterprise Data: Concepts, Concerns, and Hot-Takes

Ask GPT-4 to prove there are infinite prime numbers — while rhyming — and it delivers. But ask it how your team performed vs plan last quarter, and it will fail miserably. This illustrates a fundamental challenge of large language models (“LLMs”): they have a good grasp ...

How to Detect Data Drift with Hypothesis Testing

Data drift is a concern to anyone with a machine learning model serving live predictions. The world changes, and as the consumers’ tastes or demographics shift, the model starts receiving feature values different from what it has seen in training, which may result in unexpected outputs. Detect...

Generative AI - Document Retrieval and Question Answering with LLMs

With Large Language Models (LLMs), we can integrate domain-specific data to answer questions. This is especially useful for data unavailable to the model during its initial training, like a company's internal documentation or knowledge base. The architecture is called Retrieval Augmentat...

Can Data Science Find Bigfoot?

Watching Bigfoot researchers chasing — sometimes quite literally — circumstantial evidence, noises, and half-glimpsed shadows through the forest at night certainly makes for some great TV. I love killing time with a bit of TV, and a show called “Expedition Bigfoot” has kep...

The Role of Product Data Science

Data science organizations help companies leverage data to build better products, improve customer experiences and grow the business. Yet, data scientists can get pulled into many directions. Amongst the endless opportunities that come by their desks and questions that arise, where should they spend...

Build Large Data Pipelines Using Taipy

Building reliant, scalable, efficient and production-ready data pipelines is vital to many modern businesses today to manage and visualize data effectively and make data-driven business decisions. What’s more, teams are often required to integrate the data pipeline with a full-fledged GUI a...

What’s the Difference Between Research and Data Science?

When I first ventured out of academic labs and into the software world, I encountered my first data scientist out in the wild, and to say the least: I was confused. They may as well been an alien. Who are you? What are you doing here? Who is your leader? I spent the better half of my...

Machine Learning Introduction

Corpus is the collection of our whole data X based on which our model will calculate to give out some data Y. When our model gives out some output Y based on the data stored in corpus as X is called Machine Learning, where we generally teach our model to give some output based on a lot of data fe...

10 Useful Python Libraries Every Data Scientist Should Be Using

Python has become an essential tool for data scientists across the world. To help you boost your efficiency doing data science, we’ve put together a list of the 10 most useful Python libraries for data scientists. From speeding up your workflow with distributed computing to helping you p...

How I ended up building a quantitative trading algorithm without a data science degree

For the past 20 month or so, I’ve been building a quantitative trading algorithm in Python by myself. It’s been quite a journey, and here are the main steps and learnings. Introduction I started coding in Python in June 2020 by having to do some web scraping while in an internship ...

Pandas Library Explained

Pandas is a powerful open-source Python library that provides data structures and data analysis tools for working with structured data. It was created by Wes McKinney in 2008 and has since become a fundamental tool for data manipulation and analysis in the Python ecosystem. Pandas is particularly us...

Uncorking Insights: A Data-Driven Journey Through the World of Wine

Wine, a beverage steeped in tradition and culture, has captivated connoisseurs and novices alike for centuries. Beyond the art of sipping and savoring, the world of wine is a complex tapestry of flavors, vineyards, regions, and, of course, ratings. But what if we told you that you could dive deeper ...

How to Become a Data Analyst in One Year

Are you ready to embark on a transformative journey from a non-tech background to becoming a data analyst in just one year? This article will guide you through the practical steps you need to take to make this career transition successfully. Understanding the Data Analyst Role Before diving in...

Predicting Age & Gender using Data Science

In the field of computer vision, deep learning has found a lot of use. The domains that deal with facial data are some of the most important applications of computer vision. Face recognition and detection are common in security-related applications. Age and gender prediction are used extensively ...

My Data Scientist Internship Journey at CodersCave

My journey as a Data Scientist intern at CodersCave was a transformative experience that deepened my knowledge, honed my skills, and ignited my passion for the world of data science. Over the course of my internship, I had the privilege of working on four diverse and challenging projects that broade...

Airflow vs. Mage vs. Kestra

As a passionate advocate of Artificial Intelligence and Machine Learning, I’ve spent over a decade working with Python and delving into the world of data analysis. Along the way, I’ve come across various tools and frameworks that have helped me streamline my workflow. Today, I want to...

How to be a 10X Data Scientist

For over 6+ years, I have worked with too many people in my Data Science career. %99 of them is an average, however %1 of them were making a difference. In this article, I will share you my experience about those people, and give you the tips that I observed , which might be helping to ...

Follow This Data Validation Process to Improve Your Data Science Accuracy

This article is intended for data scientists who are either beginning or want to improve their current data validation process, serving as a general outline with some examples. First, I want to define data validation here as it can have different meanings for other, similar job roles. For the purpos...

Google launches BigQuery Data Frames

Google just announced BigQuery DataFrames — the feature is now in preview. BigQuery DataFrames is a Python API that you can use to analyze data and perform machine learning tasks in BigQuery[1]. BigQuery DataFrames combines Data Analysis and Data Science capabilities by giving you the follo...

What I Learned in my First Year as a Director of Data Science

Six months ago I noticed that there were not many posts out there about how to manage a data science team (versus data science projects, a topic that has considerably more content). In an attempt to address that gap, I wrote my original post: “What I Learned in my First Six...

Avoid the Steamroller and Ditch the Data Dump: Mastering the Art of Arguing Without Ruining…

This guide is for those who want to win arguments without losing friends, family, or their sanity. If you’re looking for ways to become the most insufferable know-it-all in the room, you’re in the wrong place. Introduction: The Argument Arena Picture this: You’re at a family ...

On Humans And Our Love of Data

Our brains love patterns. They help us make sense of the world, define our realities and understand where we are. Mathematics is a language, eloquent and as valuable, and informative to us as words, written and spoken. We use math and language to tell stories and stories become narratives which beco...

The Modern Data Stack Through ‘The Gervais Principle’

Go and Google the term “Modern Data Stack” and search through images. What do you see? It’s one big slew of architecture diagram after architecture diagram, with data flowing throughout various systems from the left to the right in most, much sound and fury signifyin...

How to Sketch your Data Science Ideas With Excalidraw

Motivation If you want your manager or colleague to understand your ideas for a project, don’t show them only words or a chunk of code. Use graphs or diagrams. Imagine you want to explain with your manager the process of training a cat classifier, it would be easier for them to understan...

Data Humanism, the Revolution will be Visualized.

Data is now recognized as one of the founding pillars of our economy, and the notion that the world grows exponentially richer in data every day is already yesterday’s news. Big Data doesn’t belong to a distant dystopian future; it’s a commodity and an intrinsic and iconic featu...

week 32 | MA Data Visualisation

I have been revisiting Laurie Anderson’s All the Things I Lost in the Flood and John Cage's A Mycological Foray. I have found both these books to be inspiring reference points and help me to step out of my project, and consider new layouts, and the ‘experience of my ou...

Industrial Metaverse: A software and data perspective

Anew buzzword percolating in many technical communities is “Metaverse.” The concepts around an emerging Metaverse are some of the most exciting developments in the convergence of business and technology. In this article, I share some of my thoughts on building Metaverse systems and their...

Who’s Really Using VR these days? Six Data-Driven insights into today’s VR User Demographic

The world of virtual reality (VR) is booming — more than 13 % of US households own a VR device, spending on average 30–45 minutes in it 2–4 times per week. Unsuprisingly, more and more brands start to look into this new medium, but often, professionals are unsure about who act...

Unlocking Data Access: Harnessing Triggers in the Absence of API Endpoints

Overview Have you ever faced a scenario wherein you’ve tried to extract a crucial data point from a transactional system (such as an e-commerce system) using its API, only to discover that the necessary information was not accessible through the provided endpoints? If so, read on to discove...

Serverless Scheduling on AWS: A Data-Driven Approach

Last week, I had two discussions where I was asked if it’s possible to create schedules based on data entered into a system and clean them afterwards. Additionally, I was asked about the right technology for this task: should it be Infrastructure as Code (IaC) or an application cron job? After...

Data Engineering with Reddit, Airflow, Celery, Postgres, S3, AWS Glue, Athena, Redshift

Building a data pipeline can be a complex task, especially when integrating multiple services and platforms. In this article, we’ll walk through the process of creating a data pipeline that fetches data from Reddit, uses Apache Airflow for orchestration, stores the data in Amazon S3, processes...

Reducing Data Platform Cost by $2M

Welcome to the Razorpay technology blog! In this blog, we are primarily focusing on cost savings by the Data platform team. We have some exciting news to share with you! Through our efforts to reduce costs, we successfully managed to cut down our platform expenses by approximately $2M per year. ...

Visualizing Log Data with Grafana, Loki, and Promtail

Congratulations on successfully setting up Grafana on your local environment! Now, it’s time to creating a dashboard using Grafana with integration of Loki and Promtail In this exciting task, we’ll explore how Grafana enables you to monitor and analyze various components of your se...

Data Engineering Project: Twitter Airflow Data Pipeline

Well, nowadays social media is abuzz with the legendary fight between Meta’s CEO Mark Zuckerberg and X’s owner Elon Musk. It has even escalated to the point of a cage fight between the two tech giants. Well we all know how that could turn out. Musk Vs Zuckerberg Cage Fight We...

T1A Develops Cloud File Transfer Framework to Democratize Data Integration

Introduction Imagine a Data Platform that’s operating smoothly, with data pipelines functioning flawlessly day in and day out. Then, all of a sudden, new demands emerge that involve integrating various files into the platform, for example: A business user urgently approaches you, aski...

Apache Spark Data Transformation: Flattening Structs & Exploding Arrays

In one of my project involving data engineering, I was faced with the task of handling JSON files. These files contained not just primitive data types, but also reference data types (arrays and structs). The project required me to read these JSON files, flatten their structure, and then save the dat...

End-to-end Azure data engineering project — Part 1: Project Requirement, Solution Architecture and ADF reading data from API

This series comprises 4 articles showcasing the comprehensive data engineering practice on the Azure platform. It encompasses the utilization of Azure Data Lake, DataBricks, Azure Data Factory, Python, Power BI, and Spark technology. In this initial part (part 1), we will acquaint ourselves with the...

Olympic Data Engineering @Azure

The Olympic Games are one of the world’s most celebrated events, bringing together athletes from diverse backgrounds to compete on a global stage. Inspired by Darshil Parmar’s insightful YouTube video on Olympic Data Engineering, this article explores the fascinating world of data engine...

Reducing Data Platform Cost by $2M

Welcome to the Razorpay technology blog! In this blog, we are primarily focusing on cost savings by the Data platform team. We have some exciting news to share with you! Through our efforts to reduce costs, we successfully managed to cut down our platform expenses by approximately $2M per year. ...

Data Lakehouses with Databricks — A Modern Approach to Data Management

In this article, we’ll look into the story behind data lakehouse architectures. We’ll also discuss the benefits of data lakehouses and how businesses can use them to improve their data analytics capabilities through the Databricks Lakehouse Platform. A look into Data Lakehouses Dat...

Policy-Based Access Control (PBAC): What It Is and Why You Need It in Your Modern Data Lakehouse?

In a recent project, I worked on PBAC implementation in a Databricks environment. It was a new challenge, but I learned a lot and am excited to share my knowledge with others. Before we go deep into PBAC, first let’s understand what are the traditional methods for access control in the data...

Building a Cloud Data Platform

Building a Cloud data platform is not simple, there are several important design considerations to keep in mind. I got an unique opportunity recently to build a greenfield Cloud Data Platform ground up for my company. For background — my company is the carved out Tea business entity of Unileve...

Unlocking Data Insights with Databricks

In the fast-evolving world of data analytics, one tool has emerged, viewed by some as a game-changer: Databricks. This cloud-based big data analytics platform is designed to streamline the complexities of data exploration, preparation and analysis. Born from the creators of Apache Spark, Databricks ...

Data traceability with machine and transaction data using Azure Databricks

This blog explains how to connect data from a transaction (OLTP) database and machine data obtained via spark streaming using Azure Databricks I use the medallion architecture to join data from the OLTP database and machine database. I will focus on the Gold layer of the architecture...

How to Read and Write Streaming Data using Pyspark

Spark is being integrated with the cloud data platform in the modern data world. Manipulating data with Spark became curial to any data persona like data engineers, data scientists, and data analysts. Last time, we covered a trivial exercise in big data on reading and writing static dat...

The perfect data pipeline doesn’t exist: Databricks

This is a multi-part article series where I dive into the ideal stack you’d use for a data engineering pipeline given constraints around what software providers to use. I aim to provide some indications of cost, ease of use, and functional limitations / cool features. Original article:&nbsp...

End-to-end Azure data engineering project — Part 2: Using DataBricks to ingest and transform data

This is a series of 4 articles demonstrating the end-to-end data engineering process on the Azure platform, using Azure Data Lake, DataBricks, Azure Data Factory, Python, Power BI and Spark technology. In this part 2, we will discuss how to use Databricks to ingest and transform the JSON files that ...

Data LakeHouse Part III — Data Ingestion

First step of building your Data LakeHouse begins with bringing your data into it. Word of Caution: If you start with this activity before careful planning and modeling, You may end up with a Data Swamp which happened with lots of DataLakes. Plan on bringing all data patterns to your LakeHouse: R...

Data LakeHouse Part III — Data Ingestion

First step of building your Data LakeHouse begins with bringing your data into it. Word of Caution: If you start with this activity before careful planning and modeling, You may end up with a Data Swamp which happened with lots of DataLakes. Plan on bringing all data patterns to your LakeHouse: R...

Data Storage in PySpark: save vs saveAsTable

When it comes to saving DataFrames in PySpark, the choice between ‘save’ and ‘saveAsTable’ is more significant than it might initially appear. Although they perform similar tasks—saving your DataFrame to a location—these methods are quite different. This article d...

Liquid Clustering: An Innovative Approach to Data Layout in Delta Lake

Introduction Announced at the 2023 Data + AI Summit [1], Delta Lake liquid clustering introduces an innovative optimization technique aimed at streamlining data layout in Delta Lake tables. Its primary goal is to enhance the efficiency of read and write operations while minimizing the need for tu...

Top Data Certifications for a Successful 2024

In the fast-paced realm of data engineering, staying ahead of the curve with cutting-edge certifications is your passport to unlocking a world of exhilarating career prospects. As we brace for the challenges and opportunities of 2024, the demand for skilled data engineers continues to soar, presenti...

Start-up Data Engineering bible: Ingestion (Part 2)

About me Hello  I’m Hugo Lu, a Data Engineer who’s also worked in Finance and now CEO@ Orchestra. Orchestra is a data release pipeline tool hat helps Data Teams release data into production reliably and efficiently. I write about what good looks like in Data. Introduction...

Azure Data Lake

TLTR: Clone this git project, set params and run 0_script.sh to deploy 1 ALDSgen2 hub and N Databricks spokes A data lake is a centralized repository of data that allows enterprises to create business value from data. Azure Databricks is a popular tool to analyze data and build dat...

My rants on Data — Data on Cloud Part 2

Disclaimer: As the title mentions, this is a rant!! Content might appear to be unorganised, random or even self-contradictory, but are my opinions. You’ve been warned. I mean no offence to all the solutions & tech companies mentioned in this article (not completely true). I have utmost ...

Passing the Databricks Professional Data Engineer Exam

How to prepare for the Databricks Data Engineer Professional Exam Databricks Certification: source I recently passed the Databricks Certified Data Engineer Professional, and I highly recommend any serious Spark developer consider taking this exam. I won’t sugarcoat it. The ...

Docker For the Modern Data Scientists: 6 Concepts You Can’t Ignore in 2023

It touches on one of the most painful problems not just in data science and ML but in all of programming — sharing applications/scripts and making the darn things work on others’ machines as well. While Microsoft, Apple, and Linus Torvalds meant well when they released different opera...

How to pass data between ViewModels?

In the dynamic realm of Android app development, crafting robust and responsive applications often involves managing complex data flows between different components. One of the key challenges developers face is passing data efficiently between ViewModels. As applications grow in complexity, the...

Crafting a Swift Package Plugin for App Data Protection

Protecting data and intellectual property is a common but challenging task in software development. No developer wishes their individual or team efforts to be unlawfully exploited by intruders or competitors. While iOS apps provide significant security — provided by Apple encryption and runtim...

8 Things Most Data Science Programs Don’t Teach (But You Should Know) — Part 1

Maybe you’ve heard about or experienced the 80/20 rule of data science before: Only 20% of data scientists’ time is spent on analysis. The majority of time — 80% — is spent simply preparing data. I think in my case it might actually be 90/10. That is just the realit...

Data Persistence in Swift: Beyond UserDefaults

Data persistence is a crucial aspect of many iOS and macOS applications. While UserDefaults is a convenient way to store small pieces of data, more complex apps often require more robust and flexible data persistence solutions. In this article, we'll explore various techniques for data...

Analysis of Amsterdam Airbnb Data

Hello friends and fellow data science enthusiasts, Colorful streets, bicycles, amazing night and a jaw-dropping architecture. What does it remind you of ? Yes, its Amsterdam. I recently got a chance to analyse Airbnb data for Amsterdam and it was a great learning for me. I would love to share ...

A quantitative analysis of Airbnb data for Bangkok

Nowadays, when looking for accommodation, Airbnb is certainly one of the services that comes to mind. It works as an online marketplace that allows people to easily rent their homes, or rooms, to guests looking for a short-term rental. Born in San Francisco, in 2007, Airbnb now has over...

What Dublin Bikes data can tell us about the city and its people

My favorite way to practice and build my data science skills is to hunt for publicly available datasets online and analyze them to uncover interesting and human behavior insights from what can initially seem like administrative, even boring, records. Public authorities are increasingly sha...

Airbnb Data Analysis — Dublin

The Airbnb is already considered to be the largest hotel company today. Oh, the detail is that it does not own any hotel! Connecting peoples who want to travel (and stay) with hosts who want to rent their properties in a pratical way, the Airbnb provides an innovative platform...

Spatial data API with GraphQL, PHP and MySQL

Some years ago I bought an internet domain offering a PHP based server backed by a MySQL database. I kept it for years without using it because I had neither the desire nor the time to write or develop. On the other hand, since when I graduated in GIS I have always had the desire to take up and impr...

Analyzing Airbnb Data— Mexico City

Taking advantage of the FIFA World Cup atmosphere, I will analyze data from one of the cities that will host the competition in 2026, Mexico City. Mexico City, besides being the capital of Mexico, is one of the world’s great tourist centers, registering 31.9 million international tourists i...

How much money you can make as a Data Analyst ?

When considering a career change or choosing a specific field, one of the first questions that come to mind is earning potential. As humans, it is natural to want to know how much money we can make in a particular field. If I talk about my personal experience I went through the same situation. As...

Exploring the Impact of Weather on Paris Bike Count Data

The “opinion” rating on the site represents a subjective assessment of weather conditions, ranging from very unfavorable to very favorable. While we don’t have detailed insights into how this rating is determined, we believe it reflects a combination of subjective perceptions and a...

Data Analysis of Rio de Janeiro’s Violence Rate

Rio de Janeiro, also known as the Wonderful City, is the most famous place in Brazil. It is also one of the favourites by tourists worldwide. Rio is well known for its Carnaval, beaches, soccer, food, Christ the Redeemer statue, music and friendly people. Copacabana Beach&n...

How I Got Some GTFS Data On a Google Map

Next weekend (Friday, November 22nd, 2019), if all goes well, a free for all public minibus service will start operating in the Gush Dan conurbation. The service will include the cities Tel Aviv — Jaffa, Givatayim, Ramat Hasharon, Holon and Kiryat Ono with six different circular lines in a fre...

Towards a Data Quality Score in open data (part 2)

I find iterative product development is as key in data as other fields given many data science tasks, e.g. model tuning or feature engineering, can be an endless pursuit: that extra 1% in accuracy is not always worth the effort and defining product increments helps establish the “good enough&r...

White Star Capital is looking for a Data Analyst in Toronto

Step into the dynamic world of data at White Star Capital! We are seeking a highly motivated Data Analyst to join our Data and Technology team, based in North America. As a Data Analyst you will play an essential role in the development of White Star Capital’s technology stack and long-term te...

On the Road to a Feminist Data Governance

It is within this context that FAIR SHARE of Women Leaders was founded in 2019, with the FAIR SHARE Monitor as its flagship project. The Monitor makes the data on women in leadership transparent,¹ holds international social impact organisations accountable for their gender g...

Easy Spatial Insights: Interactive Analysis with County Property Data

Expanding on our previous article, Unlocking Spatial Insights: Loading County Parcel Data from Esri Feature Services where we explored accessing county property data without an ArcGIS License via the ArcGIS REST Framework. Now, let’s dive deeper into the process. We’ll illustra...

How I Use Data Analytics in Real Estate Investing

Our business of over 16 years is delivering reliable passive income properties. A reliable tenant must continuously occupy the property to generate a reliable income. A reliable tenant always pays the rent on schedule, stays many years, and takes good care of the property. Since you will own the ...

Case Study: Housing Price Prediction on Zillow’s Data

Welcome to the final article in our three-part series on leveraging Zillow’s housing data for various analytical and predictive tasks. In the first article, we understood the importance and applications of web scraping along with some tools, such as Bright Data’s scraping browser. T...

The Transformative Power of Data Analysis in the Real Estate Sector

1. Market Trend Analysis: One of the primary applications of data analysis in real estate is market trend analysis. By harnessing the power of historical and current market data, industry stakeholders can identify patterns, predict market trends, and make data-driven decisions. This insight enabl...

Respectful Collection of Demographic Data

And reward them for their contributions! Women — and black women in particular — are frequently asked to volunteer their time and energy into helping others understand issues that affect them. For Product Managers, incorporate feedback requests into your user research, heed ...

Improving data and policies to support LGBTQ+ people in STEM

Shane Coffield, Kolin Clark, Anna Dye, Colbie Chinowsky, Briana Niblick, Marco Reggiani, Bryce Hughes, Alfredo Carpineti, Randall Hughes, Lauren Crawford, LeManuel Bitsóí As growing attention is placed on demographic data collection and diversity, equity, & inclusion (DEI) in ST...

Data Literacy… and the Law?

When data is added to legal and business knowledge, the result is an even more powerful analysis and decision-making framework that provides better results for both the business and its clients. But to effectively use data, a legal professional must be data literate. You may have heard of the phr...

Data Labeling: A Lighthouse on the Rocky Shores of the AI-Driven Legal-Tech Industry

This doesn’t come as a surprise, as there’s currently an influx of AI-oriented tech startups in their thousands from all directions. The legal sphere in particular has benefited tremendously from these emerging technologies, with the US, UK, and EU leading the way. Within the legal in...

Why AI for Lawyers is More Than Just a Trend: A Data-Driven Approach

Artificial Intelligence (AI) has taken the world by storm. Its impact is felt across numerous industries, and the legal sector is no exception. The buzz around AI in law has grown exponentially in recent years. But is it just a passing trend? Or is it here to stay? It’s crucial to recognize...

What are the data privacy risks that come with Autonomous Vehicles? (3/3)

The third main issue that arises with autonomous vehicles concerns data privacy. AVs can generate up to 40 terabytes (TB) of data in an hour through the use of cameras and other mechanisms to drive independently. This data is not exclusive to one vehicle alone, instead being link...

Data Divide Part 4: The Societal Impact of the Data Divide

Data has become more than just a valuable resource; it’s a cornerstone of modern society. However, the stark reality is that not all segments of society have equal access to, or benefit from, this data-empowered landscape. This disparity is often referred to as the Data Divide, and its re...

Top 10 Maritime Data Providers

As a data scientist in the supply chain industry for many years, I have had the opportunity to use a variety of maritime data providers to help optimize operations and reduce risks in shipping. In this article, I want to share my top 10 picks for the best maritime data providers that I have found to...

The Art of Mastering Inventory with Smart Data

The Scenario: A Common Challenge Imagine a product, SKU X12345, in a typical business setting. It’s a popular item, but its demand fluctuates, and the supply chain has its quirks. How do we ensure that we have just the right amount of X12345 in stock? The Data-Driven Approach The answ...

What is Master Data Management?

In my experience working in Logistic Operations, I have faced issues due to the inconsistency of master data impacting warehouse operations, transportation or business analytics. Objective Explain the importance of master data by introducing a list of key ...

Synchronizing LiDAR and Camera Data for Offline Processing Using ROS

Why is it important to synchronize this data? There are different reasons. One is that different sensors collect data at different rates. For instance, my LiDAR collects data at 10 Hz (10 samples per second) while my camera takes images at 3 Hz. For every image, we have about 3 point clouds. Another...

Attaching sensors and visualizing sensor data in Carla-viz

This blog is the continuation of my previous one which is how you can spawn a vehicle in the desired location in Carla. My aim is just to help out those who are trying to work on Self-driving cars as there isn’t enough resource available online. So in this blog, we will attach dif...

How data visualisation can distort our perception of reality

What you see is all there is. Whether you’re looking at a cover of a book, observing a couple arguing in the street, or watching a TV news report, we’re constantly making judgements based on the limited information we have access to. We’re all entitled to perceive the world...

International ghost pineapple data sipping

And the associated rationale (citizen consent permitting) public now please… anything else is illegitimate. This is going to destabilize the entire system and it’s damaging effects are exacerbated every day we get further into the election cycle… hurry up and look in the mirror. ...

Data Disaggregation for the Marginalized

I’m on the board of a theater company called Kyoung’s Pacific Beat and at the end of 2023, Other No More started to transform from the name of a political campaign against the racist way AIDS data was being collected to a theatrical piece. When people were dying during the AIDS...

Unearthing the Future: AI as the New Tool for Data Preservation and Access in Archaeology

Traditionally, archaeological exploration has been a labor-intensive process, requiring meticulous excavation and analysis. However, AI is dramatically accelerating this process. Machine learning algorithms can now sift through vast amounts of data, identifying patterns and making connections that w...

Jumping from a chemistry PhD to data science at Faire

There’s no typical example of a data scientist these days. With the field still relatively young, people from all kinds of backgrounds and experiences — including an academic one, like mine — are finding their way to a data science career. At Faire, we recognize the value o...

How I Built a Generative AI Model that can Generate Novel Small Molecules for Drug Discovery! — Part 1: Data Preprocessing

Recently, MD Anderson, the world’s best cancer treatment facility, announced that it was partnering with Generate:Biomedicine, an Artificial Intelligence (AI) drug development company, to take advantage of the rise of generative AI to produce “novel protein therapeutics.” They aim ...

How I Got From Chemistry to Data Science

People are often surprised to learn that my background is chemistry. Data science job adverts almost always specify physics, mathematics, or computer science. So how does a chemist become a data scientist? Here’s what worked for me. I started my undergraduate degree in chemistry at Imperial...

Python for Bioinformatics: Analyzing Genetic Data with Code

The Genetic Symphony: Our journey begins with the genetic symphony, where the A’s, T’s, C’s, and G’s compose the verses of life. As the symphony plays out, we introduce Python as the conductor, translating the intricate language of DNA into a readable score. Witness how ...

Building a Genomics Data Analysis Platform: A Detailed Guide

The field of genomics has experienced a revolution thanks to advancements in sequencing technologies and computational biology. Genomics data analysis platforms play a crucial role in interpreting the vast amounts of data generated by sequencing projects. This guide will provide a comprehensive over...

Precariously Balanced Boulder Data

At some date in an indefinite future, a San Andreas fault — extending from Los Angeles to San Francisco — will generate a great earthquake [EQ]. But the good news, “according to new work presented this week at a meeting of the American Geophysical Union, the ground there ...

Drilling into the Future: Is your drilling data ready for AI?

ChatGPT, Midjourney, Bard, and countless other AI tools that are now emerging are the talk of the technology world in every corner of the globe. AI has become a transformative force, reshaping industries from healthcare to finance. As companies utilize the power of AI to analyze vast datasets, pr...

Siesmic Data Processing: Signal Vs Noise

Traditionally, noise in seismic data processing is regarded as unwanted records in seismic data (wavelets) acquisition. The aim of seismic data processing is to increase the signal to noise (SN) ratio. So, what is the signal? The signal is a record of a seismic wavefront from source to receiver w...

Why You Don’t Control Your Health Data

During my residency training our hospital system was hacked. Computers were shut down and a ransom was demanded. Everything reverted back to paper. That meant lab orders, notes, vitals, and everything in between. Medical techs took on new roles as sprinters. The chaos made it clear that we had becom...

14 Probability problems for acing Data Science interviews

Questions on probability are very common in any Data Science interview. The questions might be challenging (and tricky) but are easily tractable if you have some practice and know basic formulas and concepts. In this blog, I share some practice questions (with solutions) on different concepts in pro...

15 Important Probability Concepts to Review Before Data Science Interview [Part 2]

Aspiring data scientists entering the realm of interviews often find themselves navigating a landscape heavily influenced by probability theory. Probability is a fundamental branch of mathematics that forms the backbone of statistical reasoning and data analysis. Proficiency in probability concep...

Mastering the Basics Part 13: Understanding of Statistics and Probability for Data and AI

Probability provides the mathematical foundation to model uncertainty, make predictions, evaluate models, and make data-driven decisions in the dynamic and often uncertain environments where AI and data science are applied. Concepts from probability are applied in various contexts, including data an...

Sequential A/B Testing Keeps the World Streaming Netflix Part 1: Continuous Data

Can you spot any difference between the two data streams below? Each observation is the time interval between a Netflix member hitting the play button and playback commencing, i.e., play-delay. These observations are from a particular type of A/B test that Netflix runs called a software canary ...

How I Learned to Create Stunning Data Visualization in R

Data visualization is the art and science of presenting data visually, making it easy to understand to readers, and also helping you to communicate your findings. But how do you create effective data visualizations in R? Many packages are available in R for data visualization, but one of the m...

Statistics — Data Classification

Hello all, This blog is for all the data science aspirants who want to have a better understanding of statistics or for those who got lost in the field stats and probs on where to start and what to study next in stats and probs while starting to studying for machine learning, then this blog is for ...

The Two Metrics That Reveal True Data Dispersion Beyond Standard Deviation

We’ve all heard the saying, “Variety is the spice of life,” and in data, that variety or diversity often takes the form of dispersion. Data dispersion makes data fascinating by highlighting patterns and insights we wouldn’t have found otherwise. Typically, we use the follo...

Deciphering Data Diversity: A Comprehensive Guide to Measures of Dispersion with Python

Article Outline: I. Introduction - Brief overview of the concept of dispersion in statistics and its importance in data analysis. - Introduction to the key measures of dispersion: range, variance, standard deviation, and interquartile range. II. Understanding Measures of Dispersion - Defini...

Transform Your Data Game: Mastering Scales and Types in Statistics Like Never Before!

When I gather data, I always remind myself that the way a variable is measured can greatly influence my analysis. The level of measurement refers to the specific way values assigned to each variable are structured and understood. This concept is foundational because it dictates how I can mathematica...

Unlock the Secrets of Maximum Likelihood Estimation: Your Ultimate Guide to Data Mastery!

Maximum Likelihood Estimation (MLE) represents a cornerstone technique in statistical analysis, allowing you to delve deep into the heart of data interpretation and prediction. By embracing MLE, you are equipped to navigate through the complexities of statistical models, unlocking the potential hidd...

Pharmacist to Data Analyst: The journey!

It is easy to say that my journey to becoming a data analyst started with Coursera back in December of 2022, however, this would be a slight misrepresentation of the true picture. My journey actually started back in my pre-formative primary and secondary school days where my fascination with solving...

Which is better Clinical data management or Pharmacovigilance?

Choosing the right career path in the pharmaceutical industry can be a daunting yet exciting task, especially when considering specialized fields like Clinical Data Management (CDM) and Pharmacovigilance. Both play crucial roles in ensuring the safety and efficacy of pharmaceutical products, yet the...

7 Days of Data Visualization

I’m creating a data scientist portfolio in 30 days. The first week is done, and it’s time to showcase my progress working with data visualization. The project began slightly differently, visualizing a deep neural network, which is definitely a type of data visualization but not the fi...

Museums Are Going Digital—and Borrowing From Data Viz in the Process

Afew months ago I was averaging three museum visits a week, and now I can average three virtual visits in an hour, provided I’m graced with a stable internet connection. While jumping from collection to collection, I’ve held a rather critical eye to the digital museum realm. The trick, h...

Bring Colors to your Data Frames

In this article, you’ll learn how to add colours to a pandas dataframe by using pandas styling and options/settings. The Pandas documentation is rather extensive, but if you’re searching for a somewhat more friendly introduction, I believe you’ve come to the right place. Pandas ...