In-Depth and Guides
Augment Kindly GPT with external data
if the web content you want kindly gpt to scrape is in a private location on internet e g intra net, authenticated page or it has a non standard file format e g confluence export, excel sheets etc , you can extract the data and send those to kindly using our external scraping integration api this will let you augment the kindly gpt knowledge base with urls/content that it's not accessible by our default scraper, which you usually set up in connect > kindly gpt to have an overview on how this works you setup a webserver (a k a external scraper ) the kindly server calls (a k a pull trigger ) everyday the external scraper when called, the external scraper starts to prepare the data and sends it to our data ingress endpoint whenever it is ready sequencediagram participant kindlyserver as kindly server participant externalscraper as external scraper (webserver) participant dataingress as data ingress endpoint note over externalscraper setup the external scraper webserver kindlyserver >>externalscraper daily call (trigger) externalscraper >>kindlyserver send ok response externalscraper >>externalscraper prepare data externalscraper >>dataingress send data when ready registering the external scraper this is technically the the final step of the process, but to make it more understandable we have decided to include it at the start from your, workspaces' dashboard, you can click the connect tab, then kindly gpt on the bar on the left you will see the external integration option, click read more \\ from your workspace dashboard, you can follow connect > kindly gpt > external integration > read more location of external integration from this page you can add your external scrapers make sure you have kindly gpt enabled how to setup external scrape integrations now this url will be called everyday the details of this api will be explained in incoming sections, your api will need to implement it the way it is described each bot language has separate external scraper integration setups all the languages can point to exact same url but each pull trigger only does 1 language at a time if you have multiple bots, they can use exact same url as well we send some bot identifiers with the pull trigger in the near future, we are planning to add a selection tool so you will be able to select which days of the week you want the scraping to take place details of external scrape pull trigger kindly server sends a request equivalent to the following curl command curl x post h "kindly hmac {{hmac of the body}}" h "kindly hmac algorithm "hmac sha 256 (base64 encoded)" h "kindly bot id {{your bot id}}" h "kindly bot language {{bot language}}" {{url to your scraper}} d @<(cat <\<eof { "bot id" "{{your bot id}}", "lang" "{{bot language}}", "scraper type" "external integration scraper config", "external integration url" "{{$url to your scraper}}", "run id" "{{$run id}}" } eof ) headers kindly hmac is as the name suggests hmac of the json part of the request this provides authentication and validation of the contents of the message kindly hmac algorithm indicates the algorith we are using, it is set to hmac sha 256 (base64 encoded) all the time this header exists so in the case of algorithm change we can notify the breaking change kindly bot id is the id of your bot kindly bot language is the language of your bot if your bot supports multiple languages this will be only one of them each bot language has seperate external scraper integration setups all the languages can point to exact same url but each pull trigger only does 1 language at a time you need to implement hmac validation on your end by design the endpoints will be open to world wide web and hmac is the only thing that shows origin of the message is kindly hmac will be explained in detail in next section json body as you might noticed, the most of the information in json body is duplication of the headers this is done this way to provide some extendability for the future run id is a identifier for each external scraping integration run it is provided by us external integration url is your scraper url it is needed for the next step this json body needs to be sent exactly the same when you are uploading the scraped content that will be explained in the upcoming sections hmac & kindly implementation of hmac if you never worked with hmac, we suggest you seek some external documentation on it we suggest the following okta has a good explanation of what is hmac and how to use wikipedia has very good technical details even though concept is little bit complex, it is usually easy to implement in kindly we use hmac sha 256 (base64 encoded) in short don't do any preprocess like getting rid of the whitespace at all do hmac algorith, body as a whole with sha 256 hashing and the key from your bot base 64 encode the hmac of the body if the hmac you calculated and hmac in the header are same, message can be trusted and contents of it is not manipulated to find your bot's hmac key you need to go to workspace dashboard, then settings > general > security > show key how to get hmac key of your bot you can copy this key and save it to your external scraper secrets storage respond to initial request when you receive the pull trigger, you should respond with a generic http status 200 response do not include the scraped content or anything else in the response body this is not the correct place preparing the scraped data then you need to prepare a zip file with the following constraints it should have a collection of files in the root level files can only be raw markdown ( md), raw text ( txt), or html every other file types will be ignored we are planning to add more file type support in future, but it is better to convert your files on your own integration so you can ensure what is important content and you can remove what should not be included this is the part that is mostly dependent on what you want to include sending the scraped data you will need to send a post request to datastore kindly ai with the following constraints request type needs to be multipart/form data you need to attach the zip file into file key of the form following headers are required kindly bot id kindly bot language kindly hmac is how we validate the request is coming from your service it follows the hmac pattern described previously it needs to use the whole request body in the hmac process, so contents of the files are validated as well you need to put the json you got from the pull trigger exactly in the json key of the form form structure allows us to add more functionality in the future, it is currently not used if a specific use case is needed please contact our support and we can discuss providing those features troubleshoot irregular updates if you data is not getting updated everyday or gets updated in a irregular manner, you can just reply with a 200 to pull trigger and then skip uploading to the ingress endpoint examples external scraper (go) we have an example repository here written in go (you can use ai to convert this example to any language of your liking) you will need to add your own logic to convert your own data into acceptable file formats advanced push mode rather than replying to the pull trigger, you can use the push mode of the api to achieve similar results without creating a webserver this can create some security problems on your side so be aware and try to follow instructions to the letter setup you still need to give a url in the external scraper settings be sure you own the domain and the url you are adding we use the url as the identifier for the external scrapers so these needs to be unique url if you put something like example com , a domain you don't own, people can hijack the domain and try to impersonate your system hmac will prevent this exploit, but if your hmac key gets stolen this becomes an attack factor ignoring the pull trigger we will still try to send a pull trigger to the url you have entered, if there is no webserver at the associated adress it will be ok pushing you need to replicate the json you get from pull trigger, it is easy to do but there is some nuances run id needs to be in this format "%y %m %d %h%m%s %f" details can be found on python strf documents you don't need to use python language code needs to be in the exact format we are expecting we suggest setting up a webhook testing tool, such as webhook site and adding it in the setup to get an example pull trigger then use the contents of that to build your push system our trigger happens near midnight europe/oslo time don't forget to change run id everytime you are pushing pushes have hard limit of 5 uploads per day please don't try to bypass this because it can cause unexpected answers in kindly gpt responses