Re: Asking permission to scrape OpenStack docs for external search

29 Jan 2025

      On Wed, Jan 29, 2025, at 12:01 PM, Jay Faulkner wrote:
...
Hello OpenDev admins,
Mehdi, one of our developer relations engineers here at GR-OSS, and I 
were talking about the state of OpenStack docs, and how a repeated 
complaint we hear is that search is not extremely useful. He noted that 
for https://consuldot.net GR-OSS uses a service called 
https://algolia.com to provide search (top right corner of 
consuldot.net).
We were hoping to pilot this for OpenStack and see how well the search 
worked against our documentation, but when going through the process 
they specifically request you have permission from the website owners 
before they crawl ( https://docsearch.algolia.com/apply/ ) -- so I'm 
reaching out to you to try and get this permission.
In general we don't have many problems with search index crawler bots. They tend to identify themselves clearly via user agent strings and make sequential requests over time. Bots associated with AI systems tend to be far less well behaved flooding systems with requests and re-requesting the same data over and over again (have they never heard of HTTP HEAD requests?).

As long as this system is well behaved doing things like identifying itself, respecting robots.txt, careful about redundant requests, and avoiding a flood of requests I don't expect it to be a problem. Its not like anyone else crawling the Internet right now is asking permission. I expect that this system being careful about that indicates they are careful in how they run the crawling too.

Worst case we end up temporarily blocking them if something goes wrong and we have to start over after learning more about the demands and expectations involved.

And to be clear this is permission for crawling. What is done with the content would have to abide by the license terms OpenStack applies to that content (I think it is a mix of apache 2 and some creative commons variants but I'm not sure). It isn't clear to me what sort of permission docsearch/algolia are asking for.
...
Right now our ONLY goal is to get a prototype working for this to demo 
to the larger OpenStack community -- honestly I just want to see if 
it's worth the value before we have the larger discussion around 
externally-hosted search. So basically, we're looking for permission to 
point their crawler at OpenStack documentation (we may start with just 
Ironic for simplicity). We'll create a larger discussion including the 
foundation, TC, and larger community if it appears to be a worthwhile 
pursuit.
Let me know what you think!
I personally will run searches like `site:docs.openstack.org ironic boot from volume` and find that works pretty well across major search platforms. One option may be to just drop search and rely on global search indexes with limits on where to look.
...
Thanks,
Jay Faulkner

Re: Asking permission to scrape OpenStack docs for external search

Clark Boylan