Opendev review crawler

Jeremy Stanley fungi at yuggoth.org
Tue Nov 3 18:42:15 UTC 2020


On 2020-11-03 17:43:53 +0000 (+0000), Chouchen, Moataz wrote:
> I am Moataz Chouchen a Phd Student in Ets Montréal under the
> supervision of Professor Ali Ouni. I am sending this email to you
> to explain my requests of the Opendev data to respond to your
> request regarding Opendev code review data crawling.

Thanks so much for getting back to us! I'm sorry we ended up
blocking access to your various systems, but we were at a loss for
how to otherwise get in contact with whoever was running them since
the queries were being performed anonymously (at least until we
finally saw similar API requests originating from your university).

> In fact, my Phd subject consists of the identification, the
> localization, and the understanding of Modern Code Review (MCR)
> problems (also called antipatterns). For this purpose, we need to
> crawl code review data from different projects (including Opendev)
> since this data will help us to study the process of MCR in these
> projects and their associated anti-patterns. Specifically, we need
> the whole data of Opendev to perform data analysis methods on them
> (including code review metrics distributions, social graph
> analysis etc) and study them in more depth. For this reason, we
> created a script to crawl data from Opendev and use them in our
> empirical study on MCR antipatterns in Opendev.

This sounds really interesting, and I'm sure this mailing list would
also love to receive a link to your research once you've completed
and published it. I'm thrilled you have an interest in looking into
these topics, especially with regards to the usage patterns and
trends for a service we're operating.

> I would like to thank you for your interest to understand the
> issues behind our requests and your intention to solve the issue
> for both sides. I am looking forward to collaborating with you and
> I would like to know if there are any rules that I should follow
> when requesting data from your servers and I will be more than
> happy If you can help me crawling the data and provide an access
> for me. I am also at your disposal for further clarifications
> regarding the requests. I hope I hear from you soon.

Mainly what I think we need to know is whether you've already
collected the data you need from our Gerrit API, or how much you
still need to query. The server wants to cache requests in memory
for the sake of efficiency, but as you're aware we've got quite a
lot of data. When queries for older data consume available memory on
the server it begins to spend a lot more time trying to aggressively
garbage collect cache contents, and that degrades its ability to
serve requests for you and for other users.

If you still have a lot left to query, you might try to introduce a
bit of a pause between each batch of paginated results to give the
server some time to catch up freeing memory for caching other
requests. Also if obtaining the data in a different form would be
useful to you, we might be able to look into performing a dump of
specific database tables so you don't have to slowly trickle it out
of the API.

Please do reply to the service-discuss mailing list at your earliest
convenience and let us know how we can better help you with your
research. Thanks again!
-- 
Jeremy Stanley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.opendev.org/pipermail/service-discuss/attachments/20201103/001cf0d8/attachment.sig>


More information about the service-discuss mailing list