Watch demo here: https://youtu.be/ldUYTdizbVg
When you write an inquiry to RedactedGPT the first thing it does is apply a the Trend Micro Locality Sensitive Hash and compares that hash with the TLS Hashes you have added to your database for possible leakage. Note: When you add a document (instructions below), only the the hash is saved, not the document. The document is immediately deleted so even though the app lives entirely within your network, not even the app knows what's on the document. This is really important for security purposes.
If it is similar enough it will not send the inquiry to ChatGPT and it will alert the user that it can't proceed because it seems to be similar to information we deem confidential.
If the app determines that the inquiry is not that similar to the hashes of your confidential documents, it then applies a PII removal as an additional security control.
Only then it sends the information to ChatGPT via API and returns an answer to the user.
As an additional note: the ChatGPT API doesn't store information for more than 30 days and inquiries via the API are not used to retrain their models.
Add your API Key to the .env inside the app folder and scanner folder, and your database credentials to the .env inside the _scanner folder as well as to the docker-compose.yml file.
For testing purposes I've inserted a fake username and password so that you can track it across all the files mentioned above. Please, please, please, make sure you change the username and password to a more secure one. These are only there for the intended purpose of showing you how it works.
docker-compose build
(include the --no-cache at the end of the command if needed)
docker-compose up
(include the --force-recreate at the end of the command if needed)
Open a browser on 0.0.0.0:8000 and enjoy!
We now have the capability to save a Locality Sensitive Hash for the confidential information you don't want your org to paste into ChatGPT. When someone makes an inquiry to RedactedGPT, the first thing it'll do is check if the hash of the inquiry is relatively similar to the hash of any of the documents you don't want leaked and if it is, it won't send the inquiry to ChatGPT and it will inform the user.
To save the hash of a document follow the following command from your terminal
curl -F "file=@your_file.docx" http://localhost:8002/upload
I'm using the API call from this tutorial: https://www.twilio.com/blog/integrate-chatgpt-api-python
I obtained the PII remover function from a ChatGPT prompt!
As you can probably tell I'm a huge fan of TLSH from Trend Micro. Here's the code: https://github.com/trendmicro/tlsh
-
Build a separate module for the PII removal that import the functions into the flask App, that way we can add more regex more easily.
-
Build a separate module for the hash functions.
-
Add other document types to the hashing module.
-
Improve the page to be responsive.
-
Record Partial hashes for docs for example by pages
-
Clusters hashes on the table by distance and create a column for the cluster label. That way we don't have to compare in real time against all hashes but select a random sample of each cluster instead.
-
Document how to set a good threshold
-
As of right now, the container for the webapp runs before the table on the database is created at first, I need to fix that.
-
Also in order for documents to refresh we need to restart the app container. Still debating if I want to pull in the table with every inquiry or every now and then. I guess it will depend on how often a company plans to add files to it.