-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Update network-mpi.c #133
base: master
Are you sure you want to change the base?
Conversation
Dan's Version of network-mpi.c with queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've caught a few things that I think should fix some issues.
core/network-mpi.c
Outdated
@@ -786,13 +860,31 @@ tw_net_statistics(tw_pe * me, tw_statistics * s) | |||
|
|||
if(MPI_Reduce(&(s->s_net_events), | |||
&me->stats.s_net_events, | |||
17, | |||
16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to your current changes, but at some point, you'll need to update your branch to pull back in the original code. I think you were working from an old version of this file, and tw_net_statistics
has recently changed, so you'll need to make sure that it's up to date with ROSS master. I don't think you'll need to change this at all for what you're working on, so it should just look exactly like what ROSS master currently looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, this comment means the full tw_net_statistics
function needs to be reverted back, not just this one line.
core/network-mpi.c
Outdated
|
||
if(!(e = tw_event_grab(me))) | ||
{ | ||
if(tw_gvt_inprogress(me)) | ||
tw_error(TW_LOC, "Out of events in GVT! Consider increasing --extramem"); | ||
tw_error(TW_LOC, "out of events in GVT!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here as with the comment on tw_net_statistics
below. This was a recent change, so you'll have to update your branch so that way we don't eventually undo recent changes to ROSS master.
@@ -460,9 +533,7 @@ recv_finish(tw_pe *me, tw_event *e, char * buffer) | |||
/* Fast case, we are sending to our own PE and | |||
* there is no rollback caused by this send. | |||
*/ | |||
start = tw_clock_read(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line and lines 351 and 465 needs to be added back in as well (same reason as tw_net_statistics
).
core/network-mpi.c
Outdated
unsigned int cur; | ||
int front;//add, front of queue | ||
int coda;//add, back of queue but back is already a variable somewhere | ||
int sizeOfQ;//add, size of queue array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent, we use snake_case, not camelCase.
core/network-mpi.c
Outdated
int front;//add, front of queue | ||
int coda;//add, back of queue but back is already a variable somewhere | ||
int sizeOfQ;//add, size of queue array | ||
int numInQ;//add, number of elements in queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change this to something like num_in_free_q
or something? I originally thought it was supposed to be the number of active operations you have in the queue based on the name, but eventually I realized it's supposed to be the number of free elements in the queue. Being a little clearer in the name will probably help in the future if someone else needs to make changes here.
core/network-mpi.c
Outdated
{ | ||
unsigned id = posted_recvs.cur; | ||
|
||
int id = fr_q_dq(&posted_recvs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be moved until after trying to pull the event using tw_event_grab
. As it is currently, if there is no event to grab, it will mess up your buffer's free list.
core/network-mpi.c
Outdated
tw_node *dest_node = NULL; | ||
|
||
unsigned id = posted_sends.cur; | ||
int id = fr_q_dq(&posted_sends);// fixed, grabs from front of queue, moves front up one element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will need to be moved to just after if (!e) break;
because if the event is null, it's going to mess up your free element list for posted_sends
.
core/network-mpi.c
Outdated
@@ -207,7 +274,7 @@ tw_net_minimum(tw_pe *me) | |||
e = e->next; | |||
} | |||
|
|||
for (i = 0; i < posted_sends.cur; i++) { | |||
for (i = 0; i < posted_sends.cur; i++) { //fix this line (?) | |||
e = posted_sends.event_list[i]; | |||
if (m > e->recv_ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're also going to need to add a check to see if e
is NULL.
core/network-mpi.c
Outdated
@@ -101,7 +156,19 @@ init_q(struct act_q *q, const char *name) | |||
q->event_list = (tw_event **) tw_calloc(TW_LOC, name, sizeof(*q->event_list), n); | |||
q->req_list = (MPI_Request *) tw_calloc(TW_LOC, name, sizeof(*q->req_list), n); | |||
q->idx_list = (int *) tw_calloc(TW_LOC, name, sizeof(*q->idx_list), n); | |||
q->status_list = (MPI_Status *) tw_calloc(TW_LOC, name, sizeof(*q->status_list), n); | |||
q->free_idx_list = (int *) tw_calloc(TW_LOC, name, sizeof(*q->idx_list), n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you already caught that you have an issue with reading/writing out of bounds of this list.
Changed camelCase to snake_case
variable names changed, id assigned after events are checked for null in sends, and e null check added in tw_net_minimum
added NULL check before return
Commented out line 257
Added line tw_clock start;, start = tw_clock_read();, and dest_pe->stats.s_pq += tw_clock_read() - start;. Changed error message to be more explicit, like in the master. Edited tw_net_statistics added to match master.
Exchanged incrementing in the queueing function for an addition outside of the loops (wherever the function was used).
added delete line in insert function
commented out entire recv_finish cancel if statement, within which is an else statement attached to the if(cancel!=null) for insertion
Filter anti-messages version
rolled back to original copy of all_tree.c
updated to segfaulting exception code
removed error for nonexistent events.
Exception code written and implemented, minor errors with all tree size with read_buffer size of 5000. Code is functional, but could use some optimization tweaking.
adds one AVL tree element after out of order event is detected to offset accessing tw_hash_remove() twice. Removed various debug statements in commented code.
core/network-mpi.c
Outdated
late_recv_finish(me, e, NULL, q, n); | ||
#endif | ||
late_recv_finish(me, e, NULL, q, n); | ||
//might need an augmented version for ROSS_MEMORY? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No don't worry about ROSS_MEMORY. It's being removed in another branch.
Added reset queue capability and additional new reset queue capability
@yaciud is this still a work in progress? Or, is this branch in some final state that works for you? |
I consider it a work in progress, but progress on it has stalled. In quick succession the May update to ROSS happened and the clusters I had access to got wiped or had a malfunction, so my code became outdated and it became difficult to test. I believe I have versions that work with the current version of ROSS, but there are still remnants from the outdated MPI layer in there that I'd like to get rid of before considering the work in its final stage. Also by versions I mean that I have a 3 different ways of solving the problem written and working. I'd like to optimize 2 of them a little bit more, but they all seem to be improvements over the original network layer. It would probably take less than 2 days to get everything ready if needed. |
This is the network file that uses both the methods of filling holes in the MPI_requests by moving events on the far side of the array inward (send event) and keeping track of events that have arrived and overwriting them(recv event). File still has ROSS MEMORY and debugging statements unfortunately.
So, it says queue but I switched implementations to a stack. Still messy, should be cleaned up. Small performance issue because cur variable does not shrink, causing MPI and some loops dependent on cur to check more elements than it needs to. I have a fix for it that is not currently implemented. At the end of test_q, just iterate from cur backwards to the next cell of the array that is not null, then we would need to check if we place events farther in the array then cur and either throw larger value out and continue with the next value or set cur to that new value.
This is the version that fills holes in the MPI request array by taking events from the back of the array and places them in spots where the request have been satisfied.
Merged current files with my files.
Updated old code to more easily integrate with new code
Dan's Version of network-mpi.c with queue
Dan, I'm creating a pull request so I can add in comments/feedback to specific lines of your code.
If this merge represents a feature addition to ROSS, the following items must be completed before the branch will be merged:
Include a link to your blog post in the Pull Request.