Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use imports as a mean to associate developers and projects #65

Closed
16 tasks
r0mainK opened this issue May 14, 2019 · 17 comments
Closed
16 tasks

Use imports as a mean to associate developers and projects #65

r0mainK opened this issue May 14, 2019 · 17 comments
Assignees
Labels

Comments

@r0mainK
Copy link

r0mainK commented May 14, 2019

Context

As seen in this paper, a way to identify expert in a project is to look at their commit history - more precisely to look at features involving imports of said project (in other projects). The idea is that if developers import a library a lot, then they have expertise in that import.

Task

Create import graphs from a project/set of projects/commit/commits of a developer.

Checklist

  • Retrieve all imports from ClickHouse DB containing what we could parse of PGA
  • Create co-occurence matrix of imports
    • For different granularity of imports (library, package)
    • For different granularity of co-occ (file level, subdir level, project level)
  • Create embeddings for imports with Swivel
  • Code evaluation pipelines
    • SGNS approach
    • Robusteness approach
    • Select small qualitative dataset, eg src-d repos
  • Create and evaluate project embeddings
    • For project level co-occ
    • For file level co-occ, aggregating embeddings with different wighs
  • Create and evaluate dev embeddings
    • Using commits directly
    • Using project embeddings by proxy (ownership measure ?)
  • Evaluate Dev-Project similarity
@r0mainK r0mainK self-assigned this May 14, 2019
@r0mainK
Copy link
Author

r0mainK commented May 20, 2019

I have written the feature extraction (for projects). I used it to extracts graphs per language for all of the source{d} repositories. Will upload visualizations and gist soon. I don't expect the feature extraction to be much different when extracting from specific commits. One thing to note is that, depending on the language, the xpath/attribute to provide in the query is not the same:

  • Python: //uast:RuntimeImport/Path, then extract attribute Name
  • Ruby: //uast:RuntimeImport/Path, then extract attribute Value
  • Go and JavaScript: //uast:Import/Path, then extract attribute Value
  • C#: //uast:Import/Path, then extract attribute Name
  • Java: a bit annoying, as imports get kind of separated. basically, I extracted the attribute Name with //uast:Import//uast:Identifier then did the same with //uast:Import//uast:Identifier[1] to know which subarray corresponded to each import.

I did not check for other languages (yet), talked quickly with Alex and apparently the way to query will one day be universal, but is not yet the case.

@vmarkovtsev
Copy link
Collaborator

@creachadair When do you plan to converge to a single import query?

@creachadair
Copy link

creachadair commented May 20, 2019

@creachadair When do you plan to converge to a single import query?

+ @dennwc for a reality check on what I've said here.

We don't have a specific date for this feature yet, but imports are one of several structures we plan to normalize as part of the semantic (canonicalized) UAST. For now, it's obviously complicated because of the language differences.

I think this (and several other structural normalizations) will need some schema additions and then a bunch of per-language driver updates.

I believe the main tricky part is that UAST transforms are intended to be reversible. Right now that means we sometimes have to duplicate information when we transform (or it would be lossy), and that can cause other issues; bblfsh/sdk#339 is the path I think we want to follow.

@dennwc
Copy link

dennwc commented May 20, 2019

Let me clarify this in details. Sorry about the long post, but please take some time to read it carefully.

First, the structure you see is already normalized in regards to imports. We may have some bugs in normalization, but apart from it, the normalization is already done according to a few strict rules that we have in UASTv2 schema.

Let me now answer some implicit questions from the above:

Why there are RuntimeImport and Import instead of a single Import?

Those imports have totally different semantics. If you analyze Python and find RuntimeImport, you cannot be 100% sure if the code actually imports that package or not.

This is because languages with RuntimeImport usually have an interpreter and have a dynamic import resolution where the programmer may override import rules in the runtime, or even make an import conditional by placing it into if statement.

For Import, however, it's known for sure that the code imports the specified package in 100% of the cases. Most languages that support this have declarative import statements that are not affected by the execution flow.

And there is a third import type called InlineImport. It acts differently for ones above and means that the imported file is inlined directly into the current file. So there is no namespaces or packages for the import, just file concatenation. It's only used in C/C++ for now.

We think it's important to have the distinction to know how each of those imports is applied. Generic statical analysis won't be possible without this distinction.

You may ask why don't we just add IsRuntime flag to Import instead. When designing this, we decided that it will be easy to miss this flag and assume all imports are declarative. We wanted to clearly show that those are completely different import mechanisms, so analyzers have to explicitly make the choice to either handling them as being the same or not.

So if you decide to handle them as being the same, use XPath like //uast:Import | //uast:RuntimeImport | //uast:InlineImport. All fields like Path are the same for all import kinds.

Why the query is different for languages? Why sometimes you need to use Name and sometimes Value to get the import path?

Imports are more complex than you may think. From ML side the question is simple: "give me all the import paths", but for us, it branches to a few additional questions:

  • Do you mean only filesystem imports or package imports? Or both?
  • Do you mean the real package name or the alias that this file uses? Or maybe both?
  • Do you care about symbols being imported? It may be the part of "import path" definition (for example in C# and Java).

You need to answer each of those questions. For different use cases the answer will be different, of course.

Let me guide you through the examples that @r0mainK kindly provided.

Python

//uast:RuntimeImport/Path and then extract Name

First, the issue is that I believe our transformation here is incorrect. I apologize for this, I should have examined the Python driver more closely. We will fix this ASAP.

If you examine the UAST you will see something like this:

RuntimeImport{
  Path: Identifier{
    Name: "..."
  }
}

it represents the import by the package name, the query here is //uast:RuntimeImport/Path/uast:Identifier.

This is incorrect because sometimes the import is not a package name, but rather a file path. For those cases it should be:

RuntimeImport{
  Path: String{
    Value: "..."
  }
}

it represents the import by file path, the query here is //uast:RuntimeImport/Path/uast:String.

This is important because we need a way to distinguish filesystem imports that are usually easier to resolve from the package imports that may be harder.

If you don't care about the distinction, use XPath with | the same way as I mentioned above.

Second, the query used here is not generic, because import may be aliased. In this case, the structure is the following:

RuntimeImport{
  Path: Alias{
    Name: Identifier{ Name: "foo" }
    Node: Identifier{ Name: "some-package" }
  }
}

If you only need the import itself, use /Path/Node/Name or /Path/Node/Value (FS import), if you need the alias, use /Path/Name/Name.

Ruby

//uast:RuntimeImport/Path and then extract Value

The query here is correct since Ruby uses filesystem imports. The more precise query is //uast:RuntimeImport/Path/uast:String and then extract `Value.

Go, JavaScript

//uast:Import/Path and then extract Value

Mostly correct, both languages import by file path, thus the query should be //uast:Import/Path/uast:String and then extract `Value.

Having said that, you may be surprised that we say that Go imports by file path instead of a package. This is still an open question, but for now, we think of Go imports as being on the file path, since we can always use Go import as a file path (1:1 mapping). We may consider changing it to QualifiedIdentifier (see below).

Also, the query is not generic and ignores a few questions that I listed in the beginning. Similar to Python, you will have cases where the import is aliased.

C#

//uast:Import/Path and then extract Name

As you may remember from my explanation above, Name means that it's a package import and the query may be more specific: //uast:Import/Path/uast:Identifier.

But in more complex projects you may notice that the query fails. Here is why.

In example I emphasized the last path of the query that asserts a type that you expects: the last uast:Identifier in //uast:Import/Path/uast:Identifier. Now, this query expects a single identifier as an import. In case of C#, it corresponds to using Foo import.

However, it's typical to see imports like this: using Foo.Bar in C#. The Foo.Bar part is not an uast:Identifier, it's a uast:QualifiedIdentifier, and it doesn't have a Name, it has an array of names (Names):

Import {
  Path: QualifiedIdentifier {
    Names: [
      Identifier{Name: Foo}
      Identifier{Name: Bar}
    ]
  }
}

So in this case you need to extract those Names and concatenate them however you like.

Import path may be aliased as in Go/JS/others.

Java

//uast:Import//uast:Identifier then the same with //uast:Import//uast:Identifier[1]

@r0mainK This already sounds like a hack, right? This happens exactly because the complexity here is being neglected. No offensive, I know we lack docs on Babelfish, so it's our fault that there is a misunderstanding here.

Basically, the case is exactly the same as in C# with QualifiedIdentifier. But in case of Java, imports like foo are not typical, they usually look like com.foo at least, thus there are no cases with a single Identifier and all of them end up as QualifiedIdentifiers.

So the correct query here is //uast:Import/Path/uast:QualifiedIdentifier, and then extract Names and concat them.

Also, I strongly suggest to never use [1] query unless you know what you are doing. Old UASTv1 model assumes the flat children list. It's not the case in UASTv2. Thus, you always need to specify the field you are interested in. In your case, to remove the hack it will be //uast:Import/Path/uast:QualifiedIdentifier/Names[1], but as you may imagine, this ignores the rest of the path.

And again, the import path may be aliased.

Additional cases

Maybe not related to your current use case, but some imports will have a Names and All fields in addition to a Path. Those are related to the symbols that are being imported from the package. If All is true, the whole package is imported, and if Names is specified it will contain an array of identifiers being imported. Also, both may be empty. In that case it's likely a side-effect import.

Conclusions

So answering @vmarkovtsev question:

When do you plan to converge to a single import query?

The short answer is "when UASTv3 (graph) is done". We cannot preserve all that I mentioned above and blur specific details enough to allow simple XPath for imports at the same time. But we can do this with graphs/DAGs since we will be able to have "layers" with different abstraction levels in the same UAST.

So if it sounds important for ML, let's reach out to management and make sure UASTv3 is prioritized for the next quarter in our OKRs. I proposed to do so for UASTv2, but the proposal was rejected by management. Now both our teams have to deal with the outcome.

Also, I believe the next question is:

Can you abstract this complexity for us?

The answer is "maybe".

We can make some elaborate hacks in XPath and/or clients, but all of those will be tied to this specific use case. Although I agree that the case it important, we need to find some middle ground here.

The discussion about imports and their model in UASTv2 was almost a year ago. I asked members of ML team to check UASTv2 sooner rather than later to see if the model is right for you. We got almost no feedback, so we modeled the UASTv2 to be suitable for static analysis and preserve as much information as possible. We also tried to make sure semantic differences in imports (and other types that we support) are communicated to the user.

So now that the transition to Semantic and UASTv2 started in ML, we will be happy to answer questions and help resolve any issues. But it's unlikely that we'll change the schema of UASTv2 now.

And thanks for reading the whole post. I think the context here is important.

@dennwc
Copy link

dennwc commented May 20, 2019

Now, a short practical part about how to deal with imports right now. I suggest to first query all imports with //uast:Import | //uast:RuntimeImport | //uast:InlineImport and then examine each object's Path field and extract information that you need. It will be easier than trying to deal with all the cases in XPath.

If I understood the intention correctly, you need to check the Path field and consider 4 cases (based on @type field):

  • uast:String: file path import - extract Value field as a string.
  • uast:Identifier: package import - extract Name field as a string.
  • uast:QualifiedIdentifier: fully-qualified package import - iterate over Names field, extract each item as uast:Identifier (Name), join them with . or whatever delimiter you prefer.
  • uast:Alias: package name alias: extract Node field, recurse on it (can be one of 3 types above).

@eiso
Copy link
Member

eiso commented May 21, 2019

@dennwc thank you very much for this detailed explanation, I learned a lot from it as well. We should remember to copy paste this and edit it into our documentation one day.

@vmarkovtsev
Copy link
Collaborator

vmarkovtsev commented May 21, 2019

@dennwc Thanks, this was insightful.

Regarding #65 (comment) is it possible to add this function to the Python client, since we are going to use it extensively from different packages and we do not really wish to support it? Getting such imports is indeed of big importance, and not only for this issue.

Our ML tasks always require one of the two UAST normalizations:

  1. Super generic. Lossy as hell. Impossible to reconstruct anything back. However, getting common entities such as identifiers, functions, classes, imports, comments, etc. with a simple query which works for all the languages. This is what we urgently need for the observability tasks, and what we
    used to work with before switching to assisted code reviews more than a year ago. That's how we thought "Semantic" worked.
  2. Reasonably abstract, but with traits for each language. Lossless, allows reconstruction of the original code byte to byte. Not suitable for large scale analysis. Has unified roles for every token. This is more or less how "Annotated" works atm, and we used it while we actively worked on assisted code reviews during the previous year.

It is funny that I still remember our discussion in Turing about UASTv2, and how @mcuadros complained about writing 50 LoC to extract imports in Go, and another 100 to extract imports in PHP. So having an easy way to extract the imports was one of the major v2 drivers, at least initially.

Back then on the meeting, we decided to optimize for the easy function and class retrieval - that's what everybody desired in the Gemini/Apollo context. We never had ML tasks which involved imports extraction before, neither we expected such for assisted code reviews, so they were left hanging in the air. I did check how identifiers, functions, and comments could be extracted in Semantic mode, and it worked reasonably well after some bugfixing and documentation mining.

@dennwc
Copy link

dennwc commented May 21, 2019

@vmarkovtsev I think the Semantic UAST is still in line with that past discussion - the query is as simple as //uast:Import | //uast:RuntimeImport. The structure is the same for every language, and as I mentioned, it is a generic (non-lossy) way to represent most import features including local imports and aliases. There is no more 50-100 LoC for each language, there is only one function with a type switch.

At the same time, I have to admit, there was always an internal contradiction in that design discussion. The aim was to provide a universal AST that is easy to use but without losing any information. And we've done it. But we cannot simplify it any further without losing info.

What I proposed at that time was to provide a very simple and stable API like getImports instead of XPath and AST (as you mentioned). The proposal was rejected during that discussion, however. The decision was to only expose UAST and XPath. But now it seems like we need to implement the simplified API anyway, and this is exactly the functionality ML team needed all along.

Sorry for the rant, but we cannot make "the right thing" given contradictory inputs from the tech management that oppose what other team leads think is essential moving forward.

Having said that, the issue with imports heavily affect ML and is urgent. My personal opinion is that we should stop forcing "use UAST for everything" approach and instead provide a simple API (with addition to UAST) to users in general and to ML specifically. It's also easier for LA since we can hide all schema-related details and properly resolve all the types on our own. And provide better backward compatibility, of course.

So I will start working on this API in the Python client.

However, this goes against what was decided during the design discussion. I'm linking @mcuadros and @creachadair to either support or override. In the case of override, I will transfer this piece of code to ML.

@vmarkovtsev
Copy link
Collaborator

vmarkovtsev commented May 22, 2019

The problem with the function can be that it is impossible to use it in gitbase without writing the wrappers... That's something we are going to do as well. I've got an idea: what if we hack the XPath and add //uastX:EasyImports, //uastX:EasyIdentifiers, etc.? Those would internally call the functions.

@dennwc
Copy link

dennwc commented May 22, 2019

@vmarkovtsev This can be done, but citing my comment above:

We can make some elaborate hacks in XPath and/or clients, but all of those will be tied to this specific use case.

The code in XPath will be insanely hard to maintain if we will try to hack it in a performant manner.

We can do it in a "slow" way, but then all queries will run twice slower. This is because we will do multiple passes over a single tree: one in normal mode and one in "easy" mode. In general, there is no way to find out if your query expects an "easy" tree or not.

Checking for uastX: will help in some cases, but it will be quite confusing since those nodes are always virtual and you won't see them in the UAST viewer. Those nodes will disappear if the subquery does not include uastX:, thus all things like count and [0] will act weirdly.

If only Gitbase is a problem, we can implement the same helper in Go. In fact, to implement it in Python client we will first implement it in libuast (in Go).

@dennwc
Copy link

dennwc commented May 22, 2019

There is another hack that we can do: add a post-processing step for UAST and re-project all the implicit fields from the schema. So uast:Import will have a fake gen:Path field with a concatenated import path, for example.

If functions are not an option, I would prefer this approach, since it's more in line with how UASTv3 will work.

@r0mainK
Copy link
Author

r0mainK commented May 22, 2019

Personally I would rather keep the current structure of the UAST without loss of information or speed. As Denis outlined in his two first comments, the splits between imports that exist make sense - it is simply missing in the documentation of Babelfish, hence my initial hacky approach.

I think the best idea would be to integrate it in an API part of the Python client - I'm not sure it would make sense to include it anywhere else, for instance in Gitbase, as it seems to not match the level of abstraction. But yeah, honestly I don't know if it's that important, as we can simply integrate it in one of our repos. Currently, the way I do it is simply to retrieve all UASTs with //uast:Import | //uast:RuntimeImport | //uast:InlineImport using Gitbase, then using the Python client I do what Denis described in his second comment, a switch for each node type. It is fast, and the logic takes < 50 LoC, I doubt it would be hard to maintain.

@vmarkovtsev vmarkovtsev changed the title Use imports as a mean to associate developers and projects [research] Use imports as a mean to associate developers and projects May 23, 2019
@r0mainK
Copy link
Author

r0mainK commented May 28, 2019

Update

As Vadim told me, a paper named import2vec was released at MSR 2019. As it is pertains directly to this task, going to summarize and give my thoughts on it here.

Idea: with the growing amount of packages/modules available, navigating them becomes more and more problematic. Therefore, the authors thought it would be interesting to apply ML techniques (specifically, word embedding from NLP) in order to create vector representations of library, which can be used in the context of similarity search (check out their demo website to see how it works, it's very cool). They did this on three large corpora of Python, Java and JavaScript libraries.

Specifics on data analysis:

  • they trained on 3 languages, but analyzed data on 6, adding P hp, C# and Ruby. In the case where 2 projects had the exact same imports, they considered they were clones and removed them. They analyzed a massive number of repos, from 200k to 600k, both from GitHub and each language's respective package manager.
  • they investigated ways of filtering relevant imports. Empirically, they show that a way to do so is to exclude imports which are reused in less then 10 projects - which they then did
  • they created popularity classes by binning on the amount of reused, and showed for all language this seems to follow Zipf's law.They infer that what they call classes 2 through 4, i.e. imports reused in 10 to 10k projects, are relevant: below is anecdotal as seen before, over is too common (e.g. fmt in Go, saw that intuitively when creating graphs of imports for src-d projects). They coined the class 2 (10 to 100 reuses) as surprising imports, and consider this is the class of imports where recommendation is the most useful.
  • They state that the granularity to be used for imports depends on the task (logical), in their case they only looked at libraries (so they cleansed their data, ie os.path -> os). Until now I've just removed specific imports for objects/classes but kept sub-modules in libraries, don't know whether I should go as far as them. Anyway this reduced to 20k-60k the number of unique libraries per language.
  • @vmarkovtsev while not as fine grained as you envisioned, they did compare cooccurences if taking file as context or project as context (too bad they didn't look at directory structure), they do show that the cohesion is much stronger at file level and think it's functionally more interesting to look at that level. Definitely want to investigate this further.

Specifics on training:

  • they adapted the SkipGram model to create embedding. architecture is the following: they feed a target and context library to an embedding layer, which creates an embedding for both, then both embeddings are multiplied (dot-product) and pass through a sigmoid. The model has to guess if the context library was imported in the target or not. They provided positive examples through their dataset, and negative by random sampling imports that do not co-occur in the dataset. They fed the same amount of both, and used binary cross entropy as their loss.
  • this model seems to be much stronger then Mikolov's model, as 1. the order of imports has no importance, 2. they are sure the negative examples are true negatives and 3. the ratio of negative to positive is .5 and not skewed towards negative samples.
  • they trained using class/module and library level. The two last levels seem to drastically reduce the amount of trivial clusters.
  • they experimented with embedding size of 10 to 600, determined 100 is good.
  • evaluation was done empirically (visualization) as well as by developing a prediction task and comparing with random choices. Results show it learned something, although it is not clear how well - would need to compare to other models ...
  • they briefly compared with word2vec and CBOW and it seems skipgram works best, although they only did a preliminary analysis

Conclusion:

Very good paper, gives a lot of insight on the task. - should be added to the Awesome list. I will definitely compare their model with embedding created with Swivel - the models can be found on the GitHub of Nokia. Unfortunately they only released embedding vectors, not data set of used projects or even the processed data. They did release scripts to crawl GitHub and extract imports but I guess we don't really need that.

@r0mainK
Copy link
Author

r0mainK commented May 30, 2019

Update

Okay, so as of now I am collecting a dataset of abut 10k repos per language, in order to create the embeddings. We have talked with Vadim on how to evaluate the quality of embedding, in order to compare models and select the best. The end goal being project similarity, as we do not have a dataset of similar libraries, this is a tricky question. We have come up with three ways to compare different aspects the embedings, and will use this as a proxy:

  • the first way is relatively similar to what they did in the import2vec paper: predict whether a library is imported in a project. In the paper they used the average of all imports in the project to create the project embedding, I think we should try weighted set as well. This will allow us to verify our model learned something, by comparing with random embedding, and to compare models (embedding size, granularity, etc)
  • the second way is, for a given set of projects, compute embeddings, then look at KNN created. Then remove randomly imports in the projects, recompute embeddings and KNN, and look at the variation in topology. If the embedding is good, intuitively we expect the topology to be relatively robust
  • the third way, a variant of the previous, would be to take projects and compute embeddings for it at different moments in it's history. By applying KNN to the HEAD commit, we would expect to recover the order of commits.

Once we select a model through these proxies, the final stage will be to create embeddings for say 1000 repos, then review manually the results. This will be painful, but it will come closest to evaluating the computed similarity.

@r0mainK
Copy link
Author

r0mainK commented Jun 13, 2019

Blocked until #74 is done, unless I use only a subset of PGA, or somehow hack the system.

@m09 m09 added the research label Jun 18, 2019
@m09 m09 changed the title [research] Use imports as a mean to associate developers and projects Use imports as a mean to associate developers and projects Jun 18, 2019
@r0mainK
Copy link
Author

r0mainK commented Sep 11, 2019

Update

#74 is done so this will be resumed at the end of this week, once the ClickHouse DB has been recreated. Next steps will be:

  1. Extract all imports from the ClickHouse DB, push to cluster (requires some minimal scripting as we loose a tiny bit of info with the flattened structure, but no biggie).
  2. Test different ways of creating the COOC matrix.
  3. Create the embedings matrices for imports
  4. Test different ways of creating embeddings for projects and devs:

Projects

Different things to test, first is to simply sum embeddings for all imports in the project, second is to do the same at file level then average those for project (potentially with weights)

Devs

Same there is no clear way, but I think best is to retrieve all commits of a dev, then create embeddings either for imports in the file after the commit, or for imports that the dev modified himself. The second method has been somewhat tested with success in the MSR paper that sparked this task, but in a completely different ocntext. The first method infers that when modifying a file, devs understand all the logic in the file (may be a bit much (?)). Finally, we could also just measure the ownership of devs to projects, then use that and embeddings previously created for projects.

Language

Often projects are multi-language, so we should test creating embeddings over all langs. This is heavely linked to how we create the COOC matrix, as it's not clear whether imports from distinct language should inform on each other or not.

Evaluation

Already commented on this, but basically:

  • We will check that the information is kept with an approach similar to SkipGram Negative Sampling, by verifying we can predict an import is/is not linked to a project/dev.
  • We will check robustness of the embeddings created by clustering a given number of them, removing some imports, recreating embeddings, and reclustering to see the effects.
  • We will check qualitatively on eg src-d repos or other famous ones.

I'm gonna update the checklist to reflect this comment, if you have any comments @vmarkovtsev or @m09 please tell me.

@r0mainK
Copy link
Author

r0mainK commented Sep 26, 2019

As discussed in the PR of last RML, I'm closing this issue - it's become way too long, and has changed too much over time to still be tracking anything. I'll link it to the new one to replace this.

@r0mainK r0mainK closed this as completed Sep 26, 2019
@r0mainK r0mainK mentioned this issue Sep 26, 2019
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants