Opendev review crawler

Wed Nov 4 00:02:33 UTC 2020

Hi Jeremy,

Thank you for your quick answer and for your support we will be more than happy to share our research with you once published!

For this purpose of study, we need to collect the whole Opendev data. Right now, we collected data from 2020 to 2015 and still need the other years. As per requested we will add more waiting time to my process to let the server catch up. If you also have any other suggestions that will help, we will be happy to try it.

We will be happy to receive an answer from you

Kind regards,
Moataz

________________________________
From: Jeremy Stanley
Sent: Tuesday, November 3, 2020 10:42 AM
To: service-discuss at lists.opendev.org
Cc: Chouchen, Moataz; Ouni, Ali; Laurin, François
Subject: Re: Opendev review crawler

On 2020-11-03 17:43:53 +0000 (+0000), Chouchen, Moataz wrote:
> I am Moataz Chouchen a Phd Student in Ets Montréal under the
> supervision of Professor Ali Ouni. I am sending this email to you
> to explain my requests of the Opendev data to respond to your
> request regarding Opendev code review data crawling.

Thanks so much for getting back to us! I'm sorry we ended up
blocking access to your various systems, but we were at a loss for
how to otherwise get in contact with whoever was running them since
the queries were being performed anonymously (at least until we
finally saw similar API requests originating from your university).

> In fact, my Phd subject consists of the identification, the
> localization, and the understanding of Modern Code Review (MCR)
> problems (also called antipatterns). For this purpose, we need to
> crawl code review data from different projects (including Opendev)
> since this data will help us to study the process of MCR in these
> projects and their associated anti-patterns. Specifically, we need
> the whole data of Opendev to perform data analysis methods on them
> (including code review metrics distributions, social graph
> analysis etc) and study them in more depth. For this reason, we
> created a script to crawl data from Opendev and use them in our
> empirical study on MCR antipatterns in Opendev.

This sounds really interesting, and I'm sure this mailing list would
also love to receive a link to your research once you've completed
and published it. I'm thrilled you have an interest in looking into
these topics, especially with regards to the usage patterns and
trends for a service we're operating.

> I would like to thank you for your interest to understand the
> issues behind our requests and your intention to solve the issue
> for both sides. I am looking forward to collaborating with you and
> I would like to know if there are any rules that I should follow
> when requesting data from your servers and I will be more than
> happy If you can help me crawling the data and provide an access
> for me. I am also at your disposal for further clarifications
> regarding the requests. I hope I hear from you soon.

Mainly what I think we need to know is whether you've already
collected the data you need from our Gerrit API, or how much you
still need to query. The server wants to cache requests in memory
for the sake of efficiency, but as you're aware we've got quite a
lot of data. When queries for older data consume available memory on
the server it begins to spend a lot more time trying to aggressively
garbage collect cache contents, and that degrades its ability to
serve requests for you and for other users.

If you still have a lot left to query, you might try to introduce a
bit of a pause between each batch of paginated results to give the
server some time to catch up freeing memory for caching other
requests. Also if obtaining the data in a different form would be
useful to you, we might be able to look into performing a dump of
specific database tables so you don't have to slowly trickle it out
of the API.

Please do reply to the service-discuss mailing list at your earliest
convenience and let us know how we can better help you with your
research. Thanks again!
--
Jeremy Stanley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opendev.org/pipermail/service-discuss/attachments/20201104/bab65410/attachment.html>