Introduction to Data Version Control
<h1>What is Data Version Control (DVC)?</h1>
<p>Any production-level system requires some kind of versioning.</p>
<blockquote>
<p><strong>A single source of current truth.</strong></p>
</blockquote>
<p>Any resources that are continuously updated, especially simultaneously by multiple users, require some kind of an audit trail to keep track of all changes.</p>
<p>In software engineering, the solution to this is <a href="https://git-scm.com/" rel="noopener ugc nofollow" target="_blank">Git</a>.</p>
<p>If you have written code in your life, then you are probably familiar with the beauty that is Git.</p>
<p>Git allows us to commit changes, create different branches from a source, and merge back our branches, to the original to name a few.</p>
<p>DVC is purely the same paradigm but for datasets. See, live data systems are continuously ingesting newer data points while different users carry out different experiments on the same datasets.</p>
<p>This leads to multiple versions of the same dataset, which is definitely not a single source of truth.</p>
<p>Additionally, in a machine learning environment, we would also have several versions of the same ‘model’ trained on different versions of the same dataset (for instance, model re-training to include newer data points).</p>
<p>If not properly audited and versioned, this would create a tangled web of datasets and experiments. We definitely do not want that!</p>
<p>DVC is, therefore, a system that involves tracking our datasets by registering changes on a particular dataset. There are multiple DVC solutions both free and paid.</p>
<p><a href="https://towardsdatascience.com/introduction-to-data-version-control-59fabf447a60">Visit Now</a></p>