EN DE
AI & ML,  Security

Custom NLP models for enhanced data security

We assisted a security ISV to develop a solution for detecting sensitive information in multilingual datasets. This project aimed to improve data protection and regulatory compliance by addressing the challenges of handling sensitive information in different languages, ensuring the security of data used across various sectors, including finance and government services.

Client overview

  • Our customer is the data security division of a global technology company that specializes in developing products and solutions for aerospace, defense, transportation and digital security. They are known for creating advanced systems such as avionics, cybersecurity solutions and defense technologies.

Challenge

  • Our client used regular expressions to detect sensitive data in text but this approach did not scale due to the rigid pattern matching. Context sensitive data such as addresses and names were difficult to detect as the context varies between languages and geographies.

  • They needed a solution which could do the following:

Switch over to a ML based approach to detection of sensitive data.

Detection across different languages, English and Portuguese to begin with.

Solution

We addressed the challenge by developing custom Named Entity Recognition (NER) models tailored to the client’s needs. Our approach included several key components:

Custom NER models

We proposed and built NER models specifically designed to handle multilingual data. These models leveraged pretrained BERT embeddings for contextual understanding and detection of sensitive entities. The service acted as a middleware between the client applications and cloud data services.

Fine-Tuning for specific languages

The models were fine-tuned using carefully annotated datasets to detect sensitive information in English and Portuguese. This process involved preparing the data to accommodate the unique linguistic features of each language.

Scanning for sensitive entities

Our models were configured to scan documents for sensitive items, significantly improving the efficiency and accuracy of the identification process, especially in domains requiring compliance with regulations like GDPR and HIPAA.

Scalability for future needs

While the initial solution focused on English and Portuguese, the architecture was designed to be scalable, enabling the client to extend support for additional languages in the future as their needs evolve.

Results

Improved accuracy over existing system

The NER model successfully detected local addresses and other sensitive entities missed by regular expressions.

Compliance Assurance

The client now meets stringent regulatory requirements, minimizing legal and reputational risks.

Conclusion

  • Our multilingual NER models enabled the client to enhance data security across various sectors, ensuring compliance and regulatory requirements. This scalable solution is designed to support additional languages in the future, further strengthening the company’s data protection efforts.

Other case studies