On Wed, Jan 29, 2025, at 12:01 PM, Jay Faulkner wrote:
Hello OpenDev admins,
Mehdi, one of our developer relations engineers here at GR-OSS, and I were talking about the state of OpenStack docs, and how a repeated complaint we hear is that search is not extremely useful. He noted that for https://consuldot.net GR-OSS uses a service called https://algolia.com to provide search (top right corner of consuldot.net).
We were hoping to pilot this for OpenStack and see how well the search worked against our documentation, but when going through the process they specifically request you have permission from the website owners before they crawl ( https://docsearch.algolia.com/apply/ ) -- so I'm reaching out to you to try and get this permission.
In general we don't have many problems with search index crawler bots. They tend to identify themselves clearly via user agent strings and make sequential requests over time. Bots associated with AI systems tend to be far less well behaved flooding systems with requests and re-requesting the same data over and over again (have they never heard of HTTP HEAD requests?). As long as this system is well behaved doing things like identifying itself, respecting robots.txt, careful about redundant requests, and avoiding a flood of requests I don't expect it to be a problem. Its not like anyone else crawling the Internet right now is asking permission. I expect that this system being careful about that indicates they are careful in how they run the crawling too. Worst case we end up temporarily blocking them if something goes wrong and we have to start over after learning more about the demands and expectations involved. And to be clear this is permission for crawling. What is done with the content would have to abide by the license terms OpenStack applies to that content (I think it is a mix of apache 2 and some creative commons variants but I'm not sure). It isn't clear to me what sort of permission docsearch/algolia are asking for.
Right now our ONLY goal is to get a prototype working for this to demo to the larger OpenStack community -- honestly I just want to see if it's worth the value before we have the larger discussion around externally-hosted search. So basically, we're looking for permission to point their crawler at OpenStack documentation (we may start with just Ironic for simplicity). We'll create a larger discussion including the foundation, TC, and larger community if it appears to be a worthwhile pursuit.
Let me know what you think!
I personally will run searches like `site:docs.openstack.org ironic boot from volume` and find that works pretty well across major search platforms. One option may be to just drop search and rely on global search indexes with limits on where to look.
Thanks, Jay Faulkner