Hi Jeremy, Thank you for your quick answer and for your support we will be more than happy to share our research with you once published! For this purpose of study, we need to collect the whole Opendev data. Right now, we collected data from 2020 to 2015 and still need the other years. As per requested we will add more waiting time to my process to let the server catch up. If you also have any other suggestions that will help, we will be happy to try it. We will be happy to receive an answer from you Kind regards, Moataz ________________________________ From: Jeremy Stanley Sent: Tuesday, November 3, 2020 10:42 AM To: service-discuss@lists.opendev.org Cc: Chouchen, Moataz; Ouni, Ali; Laurin, François Subject: Re: Opendev review crawler On 2020-11-03 17:43:53 +0000 (+0000), Chouchen, Moataz wrote:
I am Moataz Chouchen a Phd Student in Ets Montréal under the supervision of Professor Ali Ouni. I am sending this email to you to explain my requests of the Opendev data to respond to your request regarding Opendev code review data crawling.
Thanks so much for getting back to us! I'm sorry we ended up blocking access to your various systems, but we were at a loss for how to otherwise get in contact with whoever was running them since the queries were being performed anonymously (at least until we finally saw similar API requests originating from your university).
In fact, my Phd subject consists of the identification, the localization, and the understanding of Modern Code Review (MCR) problems (also called antipatterns). For this purpose, we need to crawl code review data from different projects (including Opendev) since this data will help us to study the process of MCR in these projects and their associated anti-patterns. Specifically, we need the whole data of Opendev to perform data analysis methods on them (including code review metrics distributions, social graph analysis etc) and study them in more depth. For this reason, we created a script to crawl data from Opendev and use them in our empirical study on MCR antipatterns in Opendev.
This sounds really interesting, and I'm sure this mailing list would also love to receive a link to your research once you've completed and published it. I'm thrilled you have an interest in looking into these topics, especially with regards to the usage patterns and trends for a service we're operating.
I would like to thank you for your interest to understand the issues behind our requests and your intention to solve the issue for both sides. I am looking forward to collaborating with you and I would like to know if there are any rules that I should follow when requesting data from your servers and I will be more than happy If you can help me crawling the data and provide an access for me. I am also at your disposal for further clarifications regarding the requests. I hope I hear from you soon.
Mainly what I think we need to know is whether you've already collected the data you need from our Gerrit API, or how much you still need to query. The server wants to cache requests in memory for the sake of efficiency, but as you're aware we've got quite a lot of data. When queries for older data consume available memory on the server it begins to spend a lot more time trying to aggressively garbage collect cache contents, and that degrades its ability to serve requests for you and for other users. If you still have a lot left to query, you might try to introduce a bit of a pause between each batch of paginated results to give the server some time to catch up freeing memory for caching other requests. Also if obtaining the data in a different form would be useful to you, we might be able to look into performing a dump of specific database tables so you don't have to slowly trickle it out of the API. Please do reply to the service-discuss mailing list at your earliest convenience and let us know how we can better help you with your research. Thanks again! -- Jeremy Stanley
On 2020-11-04 00:02:33 +0000 (+0000), Chouchen, Moataz wrote: [...]
For this purpose of study, we need to collect the whole Opendev data. Right now, we collected data from 2020 to 2015 and still need the other years. As per requested we will add more waiting time to my process to let the server catch up. If you also have any other suggestions that will help, we will be happy to try it. [...]
Just adding that delay/throttle will probably suffice, but also Ian's suggestion to set a custom user agent string with contact info on your API requests can help avoid confusion and make it easier for site administrators to reach you if there's a problem with your queries. I've gone ahead and removed the firewall rules we had temporarily blocking your IP addresses as well. If you have any questions, please don't hesitate to get in touch through our service-discuss mailing list. Good luck on the research too, I can't wait to read your findings! -- Jeremy Stanley
participants (2)
-
Chouchen, Moataz
-
Jeremy Stanley