Boosting Password Security with Natural Language Understanding: Building a Simple Password Strength Checker with BERT Transformer

Inan era where cyber threats are more pervasive than ever, ensuring the security of online accounts is of paramount importance. Passwords are often the first line of defense against unauthorized access, making their strength a critical factor in safeguarding our digital lives. <blockquote> In this article, I show how to enhance password security by harnessing the power of the BERT (Bidirectional Encoder Representations from Transformers) transformer model, one of the most common up-to-date publicly available models in Natural Language Understanding. </blockquote> The first step is to take the <a href="https://github.com/danielmiessler/SecLists/tree/master/Passwords" rel="noopener ugc nofollow" target="_blank">publicly available dataset</a> of about 1 million of the most common passwords, also publicly <a href="https://www.kaggle.com/datasets/joebeachcapital/top-10-million-passwords" rel="noopener ugc nofollow" target="_blank">available in Kaggle</a>, and to mix them with an equal sample of 1 million randomly generated complex passwords with lengths between 6 and 10 symbols, including lower and upper cases letters, digits, and common special characters. Then, I use one of the available pre-trained HuggingFace models to further train the data — <a href="https://huggingface.co/bert-base-cased" rel="noopener ugc nofollow" target="_blank">Google’s BERT case-sensitive model</a> — which has about 108 million trainable parameters. The final code for data selection and training is available <a href="https://www.kaggle.com/code/dima806/passwords-strength-checker-bert" rel="noopener ugc nofollow" target="_blank">as a Kaggle notebook</a>. The training process takes about 45 minutes using <a href="https://www.nvidia.com/en-us/data-center/tesla-p100/" rel="noopener ugc nofollow" target="_blank">NVIDIA TESLA P100 GPU</a> <a href="https://www.kaggle.com/docs/efficient-gpu-usage" rel="noopener ugc nofollow" target="_blank">available for Kaggle users</a>, and increases the overall accuracy (based on the test set) from about 50% to 99.4%: Picking some of the data samples also shows a reasonable performance of the model: <a href="https://medium.com/data-and-beyond/boosting-password-security-with-natural-language-understanding-building-a-simple-password-strength-34f52396b2fe">Visit Now</a>