Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doing partial reduceByKey in Flow created in func init() #84

Open
jdelamar opened this issue Aug 21, 2017 · 3 comments
Open

Doing partial reduceByKey in Flow created in func init() #84

jdelamar opened this issue Aug 21, 2017 · 3 comments

Comments

@jdelamar
Copy link

jdelamar commented Aug 21, 2017

Hello.

I apologize if I have missed something obvious, but I am using glow to map and reduce time series. I would like to do a reduce or reduceByKey on every time slices (for instance, reduceByKey on all events received in the last minute.

Right now, I am setting the code up to be distributed and, following the tutorial, have put my flows in the func init() section so that they are "statically -- (instantiated only once the same way)" on every nodes.

The data is coming from an unlimited stream (i.e, not from a bounded file). So I have something like this:

func init() {
	mapRecordsToMetadata.
		Channel(mapRecordsToMetadataInput).
		Map(mapTimeSeriesToTimeSliceFunc).
		Map(mapConvertColumnValuesFunc).
	        
                // ... some more maps and filters
                
                ReduceByKey(reduceByFlowKey).
		AddOutput(mapRecordsToMetadataOutput)
}

// letsDoIt uses mapRecordsToMetadata to map and reduces all events for a given key during a time slice
func letsDoIt(streamedEvents chan []string) chan GroupedEventsByKeyChan{

  out := make (chan GroupedEventsByKeyChan)
  go func() {
      for evt := range streamedEvents {
          mapRecordsToMetadataInput <- evt
     }
  }()

  go func() {
      for evt := range mapRecordsToMetadataOutput {
          out <- evt
      }
  }()

 return out
}

I have simplified a bit, but hopefully this is enough to get the idea. Now, reduceByKey is blocking until I close mapRecordsToMetadataInput input channel (makes sense). However, if I do this, I can't really use my flow mapRecordsToMetadata anymore (is there a way to replace the input channel and restart it?).

Conceptually, I would "close" my input flow (mapRecordsToMetadataInput every "time slices" where I want the aggregate to run (i.e every 30 seconds) so that my reduceByKey would run on that intervals of inputs.

My only option seems to make the "map" operations in the init() section (i.e mapRecordsToMetadataInput) and the reduceByKey() operation in a dynamic flow, recreating the dynamic flow every 30 seconds in my case.

Something like this:

func init() {
	mapRecordsToMetadata.
		Channel(mapRecordsToMetadataInput).
		Map(mapTimeSeriesToTimeSliceFunc).
		Map(mapConvertColumnValuesFunc).
	        
                // ... some more maps and filters
                // Removed the Reduce By Key 
		AddOutput(mapRecordsToMetadataOutput)
}

func letsDoIt(streamedEvents chan []string) chan GroupedEventsByKeyChan{

  out := make (chan GroupedEventsByKeyChan)
  go func() {
      for evt := range streamedEvents {
          mapRecordsToMetadataInput <- evt
     }
  }()

  go func() {
      nextInterval := time.Now().Add(30 * time.SECOND)
      for {
         
         reduceFlow := flow.New()
         reduceInChan := make(chan EventsByKeychan)
         reduceFlow.
            Channel(reduceInChan).
            ReduceByKey(reduceByFlowKey).
            AddOutput(out)

         for evt := range mapRecordsToMetadataOutput {
            reduceInChan  <- evt

            if (evt.Time.After(nextInterval) {
                //flush and reduce for that interval
                close(ReduceInChan)
                nextInterval := nextInterval.Add(30 * time.SECOND)
            }
         }
      }
  }()

 return out
}

Is this the "right" canonical way to proceed? Does that scale? Or are we missing a small feature that would allow to "flush" our static flows at fixed intervals or on demand so that we can operate on streaming use cases in a more streamline fashion?

@chrislusf
Copy link
Owner

You want a window function. This is not in glow. I am working on github.com/chrislusf/gleam and adding window functions, but not ready yet either.

@jdelamar
Copy link
Author

Thought so. Thanks. Is my "flow.New()" approach above an acceptable workaround in the meantime?

@chrislusf
Copy link
Owner

chrislusf commented Aug 22, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants