-
Notifications
You must be signed in to change notification settings - Fork 16
Roles are almost always empty #212
Comments
@vmarkovtsev Can you please elaborate what you propose exactly? Because if you want to find imports, you don't need roles in Semantic. As you may know, you can just search for a node type. Same for an import name - if you don't need a string for a name but any subtree that corresponds to it, you can use a But if you need to extract an import name as a string, things get more complicated regardless of the language and the roles won't help you. I'm not sure why you refer to Java specifically, but maybe it's less typical for Java to use import aliases? It's not the case for Python, JS and a few others, however. So instead of a simple string import name, you will get two names - an alias. And we on Babelfish side cannot know if you want one string (references resolution) or another (dependency resolution). And it will be basically the same mess with roles. Instead, we provide a helper that extracts the import name and concatenates it to a single string. For Python client, for example, Konst had the same kind of function. But we can add it to the client itself. Would it solve your issue? |
I propose adding roles to the Python driver output, just like it already happens with Java. They are much easier to work with than doing alchemy with parent types, etc. That alchemy is not consistent across languages and is structure-dependent, so I end up splitting the logic between different languages, which kills many benefits of UASTs for me. On the other hand, doing a linear scan over the roles always worked before, and I don't see the reasons why it shouldn't for 99.(9)% of our mining use-cases.
Yeah, but for some languages, it is Hence the Java driver's output is way better for me than Python's. |
The helpers would not help here because I am currently working on a project that requires the unified data conversion for all the languages supported by Babelfish and serializing the results for unknown future tasks. E.g., semantic code search. |
As you see, the problem is not with imports specifically. It is quite broad and covers all the entities we want to handle:
|
This is either not true or is a severe bug. Can you please give a link to a UAST in Semantic mode or a source file that produced this?
Right, for this specific case we can add roles. The intention is to have a schema that describes that
Again, this cannot happen in Semantic. You either work with a mix of Annotated and Semantic, or you found a bug where we don't convert a native AST subtree properly to Semantic UAST. Property names must always be the same in Semantic for the same type of a node (even derived ones).
The type of an entity is a
You mean the token, right? Or the unescaped/normalized value (e.g. import name, string literal value, etc)? Because those are different and have different assumptions.
You can't rely on roles for semantic code search. We need to talk more about this specific use case. Specifically: why the current structure is not good enough. Assuming there is no "zoo" with
We need to have a proper call to discuss this. Specifically, this list includes a list of semantic concepts (e.g. functions) and syntactic concepts (e.g. keywords). I wonder why you need a mix of those. Not saying that it useless, but a good description of the use case may help us a lot. |
I keep hitting these bugs over and over again. I don't have the bandwidth to file issues in each particular case, because their count is huge. May I suggest that you extract 10,000 random files from, say, PGA.v2, in each language, then put all the dictionary keys in a set and compare them with each other. Actually, the UASTs are already extracted, they are in
👍 Thanks. However, my issue is about having roles in Python at all. Currently, the only place where I see the roles is the root node TBH.
Nope. Example:
My current goal is reaching exactly the same structure for all the languages, exactly the same types, roles, etc. for all my simplistic targets. I don't want to specify the language when I query for imports, for example. I want to scan over the roles and get the results. I want to be lossy, none of my goals involve the backwards transform. "Destructive normalization". Deleting everything which cannot be expressed in a non-standard way, deleting outlying language features, reaching a state at which all the languages look exactly the same. We do have another task with reconstructing the code in style-analyzer, but it has always been very language-dependent anyway. We used Annotated for that task and it worked.
I mean everything at once. Any text or number associated with the node. The imports consist of several parts; in my ideal world, the parts are concatenated with a ".". Even if the language uses "/" or any other way to join the parts together, the normalized strings would be dotted. But that's my ideal world. Currently, I am happy to get the raw token value without the quotes which is language-dependent. The comments would be left-deindented, and stripped from left and right. The literals would be stripped from quotes. Integers may be left unnormalized, just like they exist in the code. Identifiers would be also merged by a dot. Ideally, I mean. In C++ you may have "::", in other languages, other ways exist. Normalizing everything would be so damn great. Even if it becomes harder to distinguish what is a namespace and what is a class name. I believe there can be workarounds with ignoreable precise tree augmentation.
In 99% of the use-cases, I can, and they would work perfectly. I am talking about simple, short queries. I built the schema, and I tried to use it, and it would work perfectly if the roles were assigned everywhere. That being said, I found the workaround with tracking nested
Yep, that's on my list! |
Thanks for an elaborate reply, it really helps with a context of the issue!
Hmm, it is surprising that it has ~none. We will double-check it.
Thanks, I now understand the use case you have. Yes, we don't even support that kind of normalization currently. I think it's worth adding a new transformation target for it. We definitely can do that in a short-mid term, but we need to discuss the details. Devil is always in the details.
We are not that far away from it, actually :) Yes, we currently may omit some nodes because we don't have a schema for it in Semantic. But, assuming the destructive normalization, we may ignore those inconsistencies and force the transformation anyway. Most of the normalization you mentioned (comments, string literals, qualified identifiers) is already performed in all drivers. So defining a new transformation target should solve this.
It would be great to see the schema as well during the call - this way we will know what is missing from our own schema or where we need a bit more abstraction.
I guess it depends on what you mean by semantic code search. But tracking nested types is definitely a better approach than roles. At least we can guarantee that those types won't collide as roles do. It may not be ideal still (e.g. missing information about If you think about alternatives, the schema provided by Semantic (assuming destructive normalization) is the same as roles: e.g. |
Well, not really 0. Alex came to me 3 hours ago, confused, and proved that some nodes do have roles 😄 Imports and functions don't. E.g. the fizzbuzz example on play.bblf.sh.
That's no secret anyway. CREATE TABLE uasts (
id Int32,
repo String,
lang String,
file String,
line Int32,
parents Array(Int32),
pkey String,
roles Array(Int16),
type String,
uptypes Array(String),
value String
) ENGINE = MergeTree() ORDER BY (repo, file, id);
I throw away all the nodes which have an empty |
@vmarkovtsev Thanks for sharing the schema! To be honest I'm a bit confused about the details. Why Also, is there a specific reason to copy the v1 XML-like structure ( |
|
Ah, I see. Thanks for clarifying it. |
Thanks for posting the schema, that's really helpful for context. One key thing I notice in this schema is that you don't really need (or want) the whole UAST, only certain parts of it. That's good, because efficiently querying a fully-squashed graph structure from a flat table requires knowing not just what parent/child relationships there are, but what their "meaning" is. Otherwise you have to resort to repeated scans to discover the nested structure. Once you know which relationships matter, you can denormalize those so that they don't need repeated scans to recover the substructure. The schema above pretty clearly relies on the fact that some tool has done this flattening ahead of time (I assume something written using the library or XPath or some combination?) (As an aside: This is the reason why generic graph databases are so complicated: To make efficient queries over graph structures without a priori denormalization takes some cleverness, and it will ~never be as fast as a custom-designed flat store for a query shape we know in advance). Even with normalizations applied, UASTs do not have exactly the same structure across languages. So I think if we want to turn this prototype into a real product, we're going to need to be more explicit about which tree structures we want to flatten. I know, for example, that imports are interesting, and you've mentioned things like "call sites occurring in a function body". To make that reproducible across languages, we can't rely on naïvely grabbing parent-child relationships from the raw tree (even after transformations). In some cases we might be able to hack it with XPath queries, but we've all seen that XPath is very sensitive to small changes in structure. I'd like to get away from having tools rely directly on the tree structure unless they really need nested syntax: For flat statistical aggregations like imports the answer is probably library functionality. Since this is experimental I'm not too worried about the brittleness of the XPath mapping yet. But I'd like to be clear that XPath isn't a sustainable way to build this for a real tool, unless we commit to never changing anything about the exact structure each driver emits. Some combination of https://github.com/src-d/feature-idea/issues/153 and library functions for specifically-defined denormalizations seems like the right path forward. |
The most of the semantic nodes do not have the assigned roles. This is very inconvenient for the majority of our mining tasks, because we do not want to write complex XPaths which take the structure into account. E.g. to get the imports, we need to take
uast:Identifier
-s which areName
-s underuast:RuntimeImport
, We want the way it works for Java.The text was updated successfully, but these errors were encountered: