-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add IWorkspace.write(Map<IFile, byte[]> ...) #1549
Conversation
CI measured a good performance gain (factor 7) also for linux:
does the CI really run with so many cores? |
Windows verification build 2x:
macOS verification build 3x:
Linux verification build 5x:
|
I'm sorry to say this but I wonder if this feature is really worth to become an API and to maintain it as such. |
If you can show how it is possible to actually do two workspace operations at once - even though they are mutal exclusive and the aliasmanager is not even threadsafe - please do. You can take |
I think the expectation would be that calling |
it's one of the fundamentals that each operation is mutual exclusive:
And that's appropriate as the listeners can be sure that one operation happens after another and that they are informed in the thread of the caller. Changing that would break the API. |
I think you are saying the writes can't be done in parallel on multiple threads by an external client because each thread needs its own workspace operation and that only by using internals to write each file content can you work around that constraint... The API seems very restrictive, focused on quite a narrow use case where one must compute all the contents as byte arrays in order to do saves in parallel. Anything that might involve streaming payloads of more significant size is just not support and cannot in any way be support externally. That seems a bit unfortunate. |
P2 Repository API has a In any case the speedup might not be because of many threads writing to the same device but more because of the batching of operations as you have noticed that the speedup is not related to the CPU threads. So maybe something similar would be useful here so one can perform many updates (maybe in parallel threads) and the notifications are then just performed at the end (like in your parallel write API). This would then even not matter what operations are performed, if stream or byte[] are used and so on... |
I agree that it would be cool if everything would just work in parallel or to have a Stream Collector like parallelStream().colllect(Workspace::collectFileContents) but as far as i know that is not possible. On the other hand the proposed API is easy to understand, simple to implement, useful and sufficient for the usecase. |
I suppose the entire design pattern for this would look pretty much the same (analogous) if in the end it called
instead of
It just feels somehow unfortunate that the more general streaming APIs are not supported in favor of a must narrower use case. |
Here is an example @Test
public void testParallelWorkspaceOperations() throws Exception {
IProject project = ResourcesPlugin.getWorkspace().getRoot().getProject("testParallel");
project.create(null);
project.open(null);
List<IFile> _30files = new ArrayList<>(30);
for (int i = 0; i < 30; i++) {
IFile file = project.getFile("file" + i);
file.create(new byte[0], 0, null);
_30files.add(file);
}
Instant before = Instant.now();
List<CompletableFuture<?>> futures = new ArrayList<>(30);
long timeToProcessMs = 2000;
for (IFile file : _30files) {
CompletableFuture<?> future = CompletableFuture.runAsync(() -> {
try {
file.getWorkspace().run(monitor -> {
try {
Thread.sleep(timeToProcessMs);
} catch (Exception ex) {
throw new RuntimeException(ex);
}
}, file, 0, null);
} catch (Exception ex) {
throw new RuntimeException(ex);
}
});
futures.add(future);
}
CompletableFuture.allOf(futures.toArray(CompletableFuture[]::new)).join();
Duration duration = Duration.between(before, Instant.now());
System.err.println(duration.getSeconds());
assertTrue(duration.toMillis() < timeToProcessMs * _30files.size());
} You'll see that things perform well: the JDT could be improved to take advantage of this without a new API. A new API might still be convenient, but I don't think it's necessary for this case. Moreover, the suggested implementation uses a MultiRule of all files, which would prevent any file from being modified is one is already referenced by the active rule. Going finer grain with the smallest possible rule (1 file) would be less blocking than creating and locking bigger rules. |
@merks ok, so you suggest to have a IWorkspace.write(Map<IFile, InputStream> ..) as well? Would be doable - But who would use that? If the client cant convert the InputStreaminto a byte array because of memory issues why he should try to write mutliple files at once? |
@mickaelistria
Counterproductive. IFile.write does NOT forward the IO in parallel! |
Thanks for trying with actual file.write.
IMO, this is where investigation should be focused. Do you know what prevents parallel IO works here (now that we've established that begin/prepareOperation are not to blame)? |
Each individual write has its's own prepareOperation. Only with a batch API that could be solved. see https://github.com/eclipse-platform/eclipse.platform/pull/1549/files#diff-1fb86c69fb8845944ed1e086cf8b5b3b6ee725ae01d5bfb3ad8070cdc87e69bdR2852 |
OK, I see that with |
So if nobody comes up with a better alternative i would like to continue with this. |
I certainly don't want to block efforts to improve performance so take what I say with a grain of salt... It seems to me the primary way to update the content of an IFile has been via InputStream and that adding API to support parallel "setContent" would primarily be focused around streaming APIs because of course one can always use ByteArrayInputStream to make use of that. I understand that this might well be less efficient than using byte[] directly, but to achieve parallelism, only a stream approach is actually needed. So it seems odd from an API point of view. I can certainly imagine generators that could produce a stream of content without ever having the entire content in memory and having that content for all files to be written in parallel. A copy or import operation of an entire project might be such an example. Of course there is also the argument "why provide API when maybe no one will use it?" That's my 2 cents worth... |
Do you miss Streaming of the IFile (Stream) or Streaming of the content (InputStream)- or both? it would be simple to provide an API |
I feel a bit sheepish because I know you understand this stuff much more deeply than do I and have invested way more effort than I do with my superficial observations... Mostly my observation is that these are the ways to update the contents: And that the new API supports only parallelizing the byte[] versions but not the InputStream versions. (I've never noticed the IFileState versions before.) I think supporting InputStream would be sensible and somehow more uniform and as you say, it looks very similar to what you already did. I don't know how others feel about that... In any case, I will not block whatever decision you make. And thanks for all the effort to make Eclipse faster! |
I would like to get feedback if resources should have it's own ExecutorService for the parallelism or if that should be passed as parameter. And if resources creates an ExecutorService on it's own if it should be kept static or created per request. Pros for passing as parameter:
Cons for having a static ExecutorService:
|
Everyone is so busy. With respect to executors, there are indeed pros and cons of of course. If one needs to pass it in, then each framework that uses this will create their own executor and then one might end up with many, which would be kind of wasteful. Creating one on each request is of course additional overhead and potentially costly per call, but no long term waste. I didn't even think of the priority thing. I don't have a good sense of what's best. 😞 |
to create multiple IFile in a batch. For example during clean-build JDT first deletes all output folders and then writes one .class file after the other. Typically many files are written sequentially. However they could be written in parallel if there would be an API. This change keeps all changes to the workspace single threaded but forwards the IO of creating multiple files to multiple threads. The single most important use case would be JDT's AbstractImageBuilder.writeClassFileContents() The speedup on windows is ~ number of cores, when they have hyper threading. OutOfMemory is not to be feared as the caller has full control how many bytes he passes.
to create multiple IFile in a batch.
For example during clean-build JDT first deletes all output folders and
then writes one .class file after the other. Typically many files are
written sequentially. However they could be written in parallel if there
would be an API.
This change keeps all changes to the workspace single threaded but
forwards the IO of creating multiple files to multiple threads.
The single most important use case would be JDT's
AbstractImageBuilder.writeClassFileContents()
The speedup on windows is ~ number of cores, when they have hyper
threading.
OutOfMemory is not to be feared as the caller has full control how many
bytes he passes.
discussion welcome.