-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract UAST from the HEAD of PGAv2 #74
Comments
So, after asking LA, it seems that they have discussed it and developed some prototypes, but they have nothing operational. Hence, we will not be compressing the UASTs, and will rely purely on Parquet for compression. They might work on it this quarter, but we shouldn't count on it in the near future. Once the Spark cluster is usable, PGAv2 is ready, and I create the features for imports task, I will take care of this. |
PGAv2 has been copied to the cluster, and Spark-Gitbase are usable. Currently I'm cleaning up the |
After talking to Vadim about progress on this, decided to start doing it now. Going to test out different schemes for compression via parquet, as well as see if I'm able to scale things with spark and gitbase-spark-connector-e. EDIT: Ok, so best scheme is no compression when writing the parquet, then tar/gzip the resulting parquet. It achieves a compression rate almost ten times better then when using the gzip by block that parquet uses. |
Okay, so this answer from Miguel and the following messages discussions with Maartje made quite clear that:
This means that:
So in order to do this task, here is the best plan I can come up with:
|
Had a meeting with Alex, updated checklist in consequence |
Semantic. |
We are meeting with Máximo tomorrow to discuss the problems, he has suggested to betray gitbase in favor of custom Go solution. Let's see. |
I coded https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga2uast to do the task.
|
We used the 7 nodes given by the infra, mining-5 to mining-11. Size of the result from the first stage: 4.5TB. Current progress of the second stage: 17/211 in 3 days. This means ETA is 34 days. However, I am using a single node. Once the DockerHub mining is over I will be able to spread the load over all the 11 nodes. |
Just finished the sanity check on the Siva files that were parsed via aggressive parallelization. My workflow was the following:
The job failed due to the presence of unreadable parquet files. The error was triggered if I read the specific parquet, or if I tried to count the rows of the DataFrame if loading from a subdir. So I went through each subdir, loading then counting the Dataframe, and if an error was caught I loaded each file in the subdir to find the corrupt ones. Once I identified all the files, I moved then to Once that was done, I repeated the previous step, with the same result but due to a different error. This one was also due to corrupt parquet files, but simply loading/counting did not trigger the error, I had to actually try to use the contents, for instance by collecting the rows. So I repeated the error finding process, and then moved them to Once that was done, I repeated the previous step, with the same result but due to a Anyway, the CSV weighs 2.02 GB and has 216,752 lines, ie the non-corrupt parquet files contain that much Here are some stats about the corrupt files (as you can see, they were all concentrated in the same 7 sub directories). Given their number, I think it's worth trying to parse the Siva files once more to see if the error was due to the process or something else (you can take the listings on the ML cluster from the locations given above).
|
Great report @r0mainK This means that I need to re-process a small fraction of files which are corrupted. |
Thks, yep the listings are in |
The new files were generated and overwritten over the corrupted ones. @r0mainK Please test once again, there shall be no corruptions this time. I had to write them directly, unfortunately. |
@vmarkovtsev no problem, anyway I did not know this but when you call a So I launched the test once again, will post results once I have them. |
Okay, the job finished in 5h30 (the
|
I have created src-d/datasets#158 to list the siva HEADs. I launched the listing with 32 goroutines on the ML cluster, it digested 17% in 18 hours. ETA 4 days. |
I parsed the OOMed sivas. I was able to process 204/211 files. The results are merged with the main dataset. @r0mainK it is time to run the check again! Regarding the listing, it is 80% complete. ETA Friday. |
Awesome, I've relaunched the process with the same config, let's see how it goes - I expect it to be done by tomorrow, unless something goes horribly wrong 🤞 |
@vmarkovtsev extraction completed ! It ran in 6 hours 8 mins, so a bit more then last time. Unfortunately, it was not 100% of all files, there are currently 835 siva/parquet files missing from the old index. I cross checked, and it seems all the missing files where from the So I tried to read and count it, and it indeed caused an error. I inspected the directory, there are 3 new files:
I loaded each one individually and tried to collect them, and you guessed it, the 2 first ones did not cause any error, it was the third 17GB one that did and caused the whole subdir to crash. So I moved that single file to |
I am still listing files in PGA, so the FS performance was degraded. I have renamed I didn't fully get 835. You are saying that there are 835 siva files under |
@vmarkovtsev no what I meant was that there were 835 missing files in the new index, that where already present in the previous index. This was due to the fact that there was one corrupt file added to the |
The file listing is at:
However, the listing has 148230 files compared to 204069 uasts. Weird. I have to re-launch the listing on the missing files. I renamed |
@r0mainK The listing is finally over! 205546 files.
The structure is flat, there are no subdirs. |
I set the access for all dirs and subdirs in
|
I have made a check on the processed index, the |
Okay so I finished the extraction, without any errors 💯 I did find there was an issue with parquet file As expected, the CSV file is a bit bigger (2.24 GB), and now contains 218,081 lines (UUID-Siva/Parquet files pairs), and a total of 36,109,756 files across all repos. This means that the stragglers added increased the file count by about 8.3 %. Also, it seems there was some duplication across PGA (most probably some files were processed twice in different UUIDS), as I found distinct 35,991,897 files over 218,023 distinct UUIDs. I then extracted from the theoretical listings for each repo that Vadim provided the list of files per UUID and, then did the sanity check. Although Vadim has warned the theoretical listings was incomplete, I still found more distinct UUIDs (219,610) and a total of 40,285,913 distinct files. Anyways here are the results of the sanity check:
I also looked into more granular results:
As can be seen, although overall we extracted ~88% of files, the extraction rate varies a lot depending on the repo. AS you can see on the scatter plot below, there seems to be a positive correlation between the number of files in the repo, and the amount that are extracted, but nothing much more. I suspect if we looked at these rates per languages we would probably find most errors come from specific driver. |
Awesome! Is it possible to study per language, also to gather repo/paths of files which could not be extracted? |
@vmarkovtsev was about to edit my message above :p So yeah that's on me actually, I think I had mentionned in meetings it would be useful to have the language of each file in the parquet files, but I forgot to put it down by writing in this issue. So currently, I could only do this using Regexp with the filenames. I already had some experience doing this way back for the apollo blogpost, when we had a similar albeit much smaller dataset, and it was pretty bad. I think we should just rerun the listing and add this information, if it is possible ? Getting the bytesize of each file would be interesting as well I think. If the processing is as efficient as the first time, we will miss less then 1% of files, and we can get that number further down with regexp. What do you think ? Also yes, I can create a CSV with the following schema if you want, using the CSVs I've created and the index: |
OK, I will edit the code and re-launch the listing tomorrow. |
I launched the new listing. |
It is funny that we've got 317150 files only in the UASTs. I hope that this time a clean run will be flawless. |
Yeah, it's surprising, especially as all of those files are in repos that were listed at least in part. |
Writing this while I remember. An important detail how we should calculate the success rate: it must be calculated on the intersection of <sivas listed> x <uasts extracted>. In other words, we should ignore listed siva files which were not at least partially extracted. The reason is that a failure to extract a siva is not necessary due to Babelfish: the file can be just too big for the given hardware and algorithm. |
Ah yeah, nice point I had not considered it 💯 I just recomputed the values, by defining the real union with your definition, ie files in sivas where at least one file was extracted. I found 38,500,270 files, which is 94.82 % of the previous union. This means that:
|
@r0mainK The new listing has finished, it is in the same place. 205283 files. As you see, this number is less than the previous one and I am going crazy to find out why. I need to find the missing names and re-launch on them. |
Update: I found the missing 801 files, listed them and put to the output directory. Now the overall count is 206084 and we should not have weird files that are present only in the uasts. Please run the stats per language! |
@vmarkovtsev running the stats noew, however still missing 97,049 files in the listing. They come from 402 siva files, all of which were in the listing (save for the repos). No idea why they were not listed. I will update today with language stats EDIT: okay so this might actually be a bug in my reading of the CSV files (newlines in some of the filenames, just loving it) |
OK so this is is the final report (hopefully). In the following, I'll be calling extracted rate the ratio of files extracted over all files, and success rate the ratio of files extracted over files in repositories at least partially extracted (ie where Babelfish errored, not something else). I'll also be calling theoretical the listing Vadim created from crawling PGA, and processed the listing I extracted from the Parquet files. preliminaryFirst off here are various counts of interest for both listings. As planned, I did not found files in the processed listing that were not in the theoretical listing, however I did found a small amount of files were duplicated in both listings (ie had the same UUID and file path). I do not know why that was the case, I'm guessing something in PGA, or the way we crawled it.
In the following I'll be computing stats over the distinct files. extraction and success rateSo overall, a few things to note:
language specific analysisI ran the same analysis per language as you asked. As you can see, results are clearly unequal. Looking first at files, we can see 3 groups appear:
Looking now at bytes, we see the same trend as before, ie both rates are lower as the larger files are the ones causing problems. However, there are still some things to note:
repo specific analysisI plotted the same heatmap as in one of the posts above, it seems the strange distribution was due to errors in my code. This time, I found no correlation between the number of files in a given repository, and the fraction of extracted files in that repository. As you can see from the histograms below, most repos were fully extracted (61 % of them), and the ratio of extracted files per repo actually follow an exponential law, further indicating that we're looking at sporadic events. I did not look per language, I'm guessing we'd find the same distribution but more/less pronounced depending on the driver. conclusionAnyway, I think I've covered more or less everything. If you want me to go more into detail no problem. By the way, here is a zipped CSV with the |
@r0mainK Is it possible to add the language to that CSV file with bblfsh errors?
|
@vmarkovtsev Added language and repository name. Left the PGA metadata in case, anyway now schema is The zipped CSV can be found here |
Context
As soon as Infra team has copied PGAv2 to the ML cluster, we will start using it often. Most of the time, we will be extracting UAST from the HEAD, and then do something. There is no reason to repeatedly query Gitbase to do this, so we should do it only once.
Task
Use GitBase to extract and store for all parsable files of the HEAD the UAST, repository name, file name and language. The storage format should be compatible with Spark so we can easily reuse it, hence it should probably be Parquet. The UASTs being relatively heavy, we should see if we can compress them further beforehand, see if LA team has any insight on this.
Checklist
native,annotated, semantic)The text was updated successfully, but these errors were encountered: