-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use imports as a mean to associate developers and projects #65
Comments
I have written the feature extraction (for projects). I used it to extracts graphs per language for all of the source{d} repositories. Will upload visualizations and gist soon. I don't expect the feature extraction to be much different when extracting from specific commits. One thing to note is that, depending on the language, the xpath/attribute to provide in the query is not the same:
I did not check for other languages (yet), talked quickly with Alex and apparently the way to query will one day be universal, but is not yet the case. |
@creachadair When do you plan to converge to a single import query? |
+ @dennwc for a reality check on what I've said here. We don't have a specific date for this feature yet, but imports are one of several structures we plan to normalize as part of the semantic (canonicalized) UAST. For now, it's obviously complicated because of the language differences. I think this (and several other structural normalizations) will need some schema additions and then a bunch of per-language driver updates. I believe the main tricky part is that UAST transforms are intended to be reversible. Right now that means we sometimes have to duplicate information when we transform (or it would be lossy), and that can cause other issues; bblfsh/sdk#339 is the path I think we want to follow. |
Let me clarify this in details. Sorry about the long post, but please take some time to read it carefully. First, the structure you see is already normalized in regards to imports. We may have some bugs in normalization, but apart from it, the normalization is already done according to a few strict rules that we have in UASTv2 schema. Let me now answer some implicit questions from the above: Why there are
|
Now, a short practical part about how to deal with imports right now. I suggest to first query all imports with If I understood the intention correctly, you need to check the
|
@dennwc thank you very much for this detailed explanation, I learned a lot from it as well. We should remember to copy paste this and edit it into our documentation one day. |
@dennwc Thanks, this was insightful. Regarding #65 (comment) is it possible to add this function to the Python client, since we are going to use it extensively from different packages and we do not really wish to support it? Getting such imports is indeed of big importance, and not only for this issue. Our ML tasks always require one of the two UAST normalizations:
It is funny that I still remember our discussion in Turing about UASTv2, and how @mcuadros complained about writing 50 LoC to extract imports in Go, and another 100 to extract imports in PHP. So having an easy way to extract the imports was one of the major v2 drivers, at least initially. Back then on the meeting, we decided to optimize for the easy function and class retrieval - that's what everybody desired in the Gemini/Apollo context. We never had ML tasks which involved imports extraction before, neither we expected such for assisted code reviews, so they were left hanging in the air. I did check how identifiers, functions, and comments could be extracted in Semantic mode, and it worked reasonably well after some bugfixing and documentation mining. |
@vmarkovtsev I think the Semantic UAST is still in line with that past discussion - the query is as simple as At the same time, I have to admit, there was always an internal contradiction in that design discussion. The aim was to provide a universal AST that is easy to use but without losing any information. And we've done it. But we cannot simplify it any further without losing info. What I proposed at that time was to provide a very simple and stable API like Sorry for the rant, but we cannot make "the right thing" given contradictory inputs from the tech management that oppose what other team leads think is essential moving forward. Having said that, the issue with imports heavily affect ML and is urgent. My personal opinion is that we should stop forcing "use UAST for everything" approach and instead provide a simple API (with addition to UAST) to users in general and to ML specifically. It's also easier for LA since we can hide all schema-related details and properly resolve all the types on our own. And provide better backward compatibility, of course. So I will start working on this API in the Python client. However, this goes against what was decided during the design discussion. I'm linking @mcuadros and @creachadair to either support or override. In the case of override, I will transfer this piece of code to ML. |
The problem with the function can be that it is impossible to use it in gitbase without writing the wrappers... That's something we are going to do as well. I've got an idea: what if we hack the XPath and add |
@vmarkovtsev This can be done, but citing my comment above:
The code in XPath will be insanely hard to maintain if we will try to hack it in a performant manner. We can do it in a "slow" way, but then all queries will run twice slower. This is because we will do multiple passes over a single tree: one in normal mode and one in "easy" mode. In general, there is no way to find out if your query expects an "easy" tree or not. Checking for If only Gitbase is a problem, we can implement the same helper in Go. In fact, to implement it in Python client we will first implement it in |
There is another hack that we can do: add a post-processing step for UAST and re-project all the implicit fields from the schema. So If functions are not an option, I would prefer this approach, since it's more in line with how UASTv3 will work. |
Personally I would rather keep the current structure of the UAST without loss of information or speed. As Denis outlined in his two first comments, the splits between imports that exist make sense - it is simply missing in the documentation of Babelfish, hence my initial hacky approach. I think the best idea would be to integrate it in an API part of the Python client - I'm not sure it would make sense to include it anywhere else, for instance in Gitbase, as it seems to not match the level of abstraction. But yeah, honestly I don't know if it's that important, as we can simply integrate it in one of our repos. Currently, the way I do it is simply to retrieve all UASTs with |
UpdateAs Vadim told me, a paper named import2vec was released at MSR 2019. As it is pertains directly to this task, going to summarize and give my thoughts on it here. Idea: with the growing amount of packages/modules available, navigating them becomes more and more problematic. Therefore, the authors thought it would be interesting to apply ML techniques (specifically, word embedding from NLP) in order to create vector representations of library, which can be used in the context of similarity search (check out their demo website to see how it works, it's very cool). They did this on three large corpora of Python, Java and JavaScript libraries. Specifics on data analysis:
Specifics on training:
Conclusion: Very good paper, gives a lot of insight on the task. - should be added to the Awesome list. I will definitely compare their model with embedding created with Swivel - the models can be found on the GitHub of Nokia. Unfortunately they only released embedding vectors, not data set of used projects or even the processed data. They did release scripts to crawl GitHub and extract imports but I guess we don't really need that. |
UpdateOkay, so as of now I am collecting a dataset of abut 10k repos per language, in order to create the embeddings. We have talked with Vadim on how to evaluate the quality of embedding, in order to compare models and select the best. The end goal being project similarity, as we do not have a dataset of similar libraries, this is a tricky question. We have come up with three ways to compare different aspects the embedings, and will use this as a proxy:
Once we select a model through these proxies, the final stage will be to create embeddings for say 1000 repos, then review manually the results. This will be painful, but it will come closest to evaluating the computed similarity. |
Blocked until #74 is done, unless I use only a subset of PGA, or somehow hack the system. |
Update#74 is done so this will be resumed at the end of this week, once the ClickHouse DB has been recreated. Next steps will be:
Projects Different things to test, first is to simply sum embeddings for all imports in the project, second is to do the same at file level then average those for project (potentially with weights) Devs Same there is no clear way, but I think best is to retrieve all commits of a dev, then create embeddings either for imports in the file after the commit, or for imports that the dev modified himself. The second method has been somewhat tested with success in the MSR paper that sparked this task, but in a completely different ocntext. The first method infers that when modifying a file, devs understand all the logic in the file (may be a bit much (?)). Finally, we could also just measure the ownership of devs to projects, then use that and embeddings previously created for projects. Language Often projects are multi-language, so we should test creating embeddings over all langs. This is heavely linked to how we create the COOC matrix, as it's not clear whether imports from distinct language should inform on each other or not. Evaluation Already commented on this, but basically:
I'm gonna update the checklist to reflect this comment, if you have any comments @vmarkovtsev or @m09 please tell me. |
As discussed in the PR of last RML, I'm closing this issue - it's become way too long, and has changed too much over time to still be tracking anything. I'll link it to the new one to replace this. |
Context
As seen in this paper, a way to identify expert in a project is to look at their commit history - more precisely to look at features involving imports of said project (in other projects). The idea is that if developers import a library a lot, then they have expertise in that import.
Task
Create import graphs from a project/set of projects/commit/commits of a developer.
Checklist
The text was updated successfully, but these errors were encountered: