What is Data Version Control (DVC)?
Any production-level system requires some kind of versioning.
A single source of current truth.
Any resources that are continuously updated, especially simultaneously by multiple users, require some kind of an audit trail to keep track of all changes.
In software engineering, the solution to this is Git.
If you have written code in your life, then you are probably familiar with the beauty that is Git.
Git allows us to commit changes, create different branches from a source, and merge back our branches, to the original to name a few.
DVC is purely the same paradigm but for datasets. See, live data systems are continuously ingesting newer data points while different users carry out different experiments on the same datasets.
This leads to multiple versions of the same dataset, which is definitely not a single source of truth.
Additionally, in a machine learning environment, we would also have several versions of the same ‘model’ trained on different versions of the same dataset (for instance, model re-training to include newer data points).
If not properly audited and versioned, this would create a tangled web of datasets and experiments. We definitely do not want that!
DVC is, therefore, a system that involves tracking our datasets by registering changes on a particular dataset. There are multiple DVC solutions both free and paid.