Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GAT layer #3

Open
rkurchin opened this issue Aug 24, 2020 · 6 comments
Open

add GAT layer #3

rkurchin opened this issue Aug 24, 2020 · 6 comments
Assignees
Labels
longterm longer-term idea, likely won't be implemented in the near term

Comments

@rkurchin
Copy link
Member

I'd like to implement a graph attention mechanism a la this paper.

@rkurchin rkurchin added the enhancement New feature or request label Aug 24, 2020
@rkurchin rkurchin self-assigned this Aug 24, 2020
@rkurchin
Copy link
Member Author

Question: Is there a way to do this for input graphs with variable size (number of nodes)?

@rkurchin
Copy link
Member Author

rkurchin commented Aug 24, 2020

it seems that they do so in this paper, CODE HERE... I had trouble figuring this out (they also have the extra complication of this clustering thing) but this seems to be a way to do it. My best summary of the procedure:

NOTATION:

attention heads: h
input features for layer: l
output features for layer: m

LAYER PROPERTIES

a_l, a_r (splitting concatenated thing from eq3 in Velickovic into two separate parts): [1 x h x m]
linear dense layer W: [l x mh]

LAYER INPUTS:

graph with n nodes
feature matrix feat_in: [n x l]

LAYER ACTIONS (my commentary in italics):

  1. apply dense layer: feat = feat_in x W: [n x l] x [l x mh] => [n x mh], reshape to [n x h x m] (this is the usual dealio)
  2. compute attention coefficients: elementwise multiply of feat * a_l and a_r, yielding e_l and e_r: [n x h x m] * [1 x h x m] => [n x h x m] (for an undirected graph, l and r are the same, but there are these updates of graph.srcdata and graph.dstdata which I'm still trying to puzzle through whether that breaks the symmetry, I don't think so though...)
  3. sum this along final dimension and unsqueeze: [n x h x 1] for each e_l and e_r (why...?)
  4. compute edge attention: calculate message on each edge as sum of e_l and e_r (again, for an undirected graph we've just doubled it)
  5. apply activation to these messages, then softmax the results, still store as messages on edges (fine, just a normalization basically)
  6. pass messages along edges to each node, multiply features (element-wise) by attention: [n x h x m] * [n x h x m], sum along first dimension at each node which I think is h, so end result is [n x m] as desired (so much element-wise multiplication...)
  7. apply output activation

What I wonder...

Does this end up being different from a convolution? If so, how?

Since the attention is trainable per-head and per-output-feature, what are we really doing? I think the message passing should somehow encode structure, since a node with more neighbors will have a larger sum of input stuff, but how would the attention tensors ever "learn" about that in order to weight things appropriately?

@rkurchin rkurchin added longterm longer-term idea, likely won't be implemented in the near term and removed enhancement New feature or request labels Aug 31, 2020
@rkurchin
Copy link
Member Author

Summary/my takeaways of chat with Shaojie about this just now:

  • can think of this as a fully-trainable, weighted convolution: how much prior information about graph structure/connectivity is assumed is totally up to the user
  • because of this, ALL positional information MUST be encoded in the feature vector (or as strict conditions/masking on attentional weights) if it is important at all
  • there also must be a consistent indexing scheme for nodes such that if the same structure with the same node labels is fed in, an identical set of index labels comes out
  • I don't think this is necessarily able to do exactly what I had envisioned "out-of-the-box," namely discerning patterns about different local structures (i.e. connectivity patterns) in a graph; or at least not without combining it with something else like EigenPooling or graph clustering and subsequent labeling of subgraphs/supernodes with information about their structure
  • NB that training of self-attention can be unstable/sensitive
  • definitely a powerful technique that could add a lot, but a lot of things would need to be updated about how graph-building and featurization are done, so probably not a short-term priority

@rkurchin rkurchin mentioned this issue Oct 30, 2020
@thazhemadam
Copy link
Member

Bump. now that we've made some pretty significant changes wrt we do graph building and featurization, it's probably a good time to discuss how we could go about this?

@rkurchin
Copy link
Member Author

rkurchin commented Oct 4, 2021

I would be very psyched for this to be implemented. Realistically, I won't personally have the bandwidth in the super near future (i.e. the next ~4-6 weeks), but certainly happy to discuss and/or sketch out in more detail for whoever ends up taking a crack at it, be that future me, current/future you, or someone else entirely! :)

@jonathanmfung
Copy link

Here is another paper: "Crystal graph attention networks for the prediction of stable materials"

Paper: doi:10.1126/sciadv.abi7948
Python Code: hyllios/CGAT
Data: materialscloud:2021.222

I believe the novelty here is "replacing the precise bond distances with embeddings of graph distances", but cmiiw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
longterm longer-term idea, likely won't be implemented in the near term
Projects
None yet
Development

No branches or pull requests

3 participants