Introduction to Data Version Control

<h1>What is Data Version Control (DVC)?</h1> <p>Any production-level system requires some kind of versioning.</p> <blockquote> <p><strong>A single source of current truth.</strong></p> </blockquote> <p>Any resources that are continuously updated, especially simultaneously by multiple users, require some kind of an audit trail to keep track of all changes.</p> <p>In software engineering, the solution to this is&nbsp;<a href="https://git-scm.com/" rel="noopener ugc nofollow" target="_blank">Git</a>.</p> <p>If you have written code in your life, then you are probably familiar with the beauty that is Git.</p> <p>Git allows us to commit changes, create different branches from a source, and merge back our branches, to the original to name a few.</p> <p>DVC is purely the same paradigm but for datasets. See, live data systems are continuously ingesting newer data points while different users carry out different experiments on the same datasets.</p> <p>This leads to multiple versions of the same dataset, which is definitely not a single source of truth.</p> <p>Additionally, in a machine learning environment, we would also have several versions of the same &lsquo;model&rsquo; trained on different versions of the same dataset (for instance, model re-training to include newer data points).</p> <p>If not properly audited and versioned, this would create a tangled web of datasets and experiments. We definitely do not want that!</p> <p>DVC is, therefore, a system that involves tracking our datasets by registering changes on a particular dataset. There are multiple DVC solutions both free and paid.</p> <p><a href="https://towardsdatascience.com/introduction-to-data-version-control-59fabf447a60">Visit Now</a></p>