Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract UAST from the HEAD of PGAv2 #74

Closed
10 of 11 tasks
r0mainK opened this issue Jun 6, 2019 · 40 comments
Closed
10 of 11 tasks

Extract UAST from the HEAD of PGAv2 #74

r0mainK opened this issue Jun 6, 2019 · 40 comments
Assignees
Labels

Comments

@r0mainK
Copy link

r0mainK commented Jun 6, 2019

Context

As soon as Infra team has copied PGAv2 to the ML cluster, we will start using it often. Most of the time, we will be extracting UAST from the HEAD, and then do something. There is no reason to repeatedly query Gitbase to do this, so we should do it only once.

Task

Use GitBase to extract and store for all parsable files of the HEAD the UAST, repository name, file name and language. The storage format should be compatible with Spark so we can easily reuse it, hence it should probably be Parquet. The UASTs being relatively heavy, we should see if we can compress them further beforehand, see if LA team has any insight on this.

Checklist

  • Ask LA if they have any ideas regarding UAST compression, possibly implement a tool to compress/decompress.
  • Ask Infra to add distributed storage between Spark workers and Jupyter pods, issue
  • Ask Vadim which kind of UAST should be retrieved (native, annotated, semantic)
  • Code the payload that will do the extraction
  • Parse everything with aggressive parallelization. (Vadim)
  • Parse the OOMed sivas with very conservative parallelization. (Vadim)
  • Verify the result and find which files could not be parsed in each HEAD.
    • Create file from parquet Parquet with siva filename, HEAD UUID, file_list (Romain)
    • Create file with List of files in HEAD of each Siva (Vadim)
    • Compute success % (Romain)
  • (optional) Check the output can be used with Gemini once parquet input is added to it
@r0mainK r0mainK self-assigned this Jun 6, 2019
@r0mainK
Copy link
Author

r0mainK commented Jun 7, 2019

So, after asking LA, it seems that they have discussed it and developed some prototypes, but they have nothing operational. Hence, we will not be compressing the UASTs, and will rely purely on Parquet for compression. They might work on it this quarter, but we shouldn't count on it in the near future.

Once the Spark cluster is usable, PGAv2 is ready, and I create the features for imports task, I will take care of this.

@r0mainK r0mainK changed the title [research] Extract UAST from the HEAD of PGAv2 [dataset] Extract UAST from the HEAD of PGAv2 Jun 7, 2019
@r0mainK
Copy link
Author

r0mainK commented Jun 11, 2019

PGAv2 has been copied to the cluster, and Spark-Gitbase are usable. Currently I'm cleaning up the /user/repositories directory to only have PGA in it, it should take until tomorrow at the current speed, but then will try to do this after the imports task - so I can assess the problems of dealing with so much data.

@r0mainK
Copy link
Author

r0mainK commented Jun 12, 2019

After talking to Vadim about progress on this, decided to start doing it now. Going to test out different schemes for compression via parquet, as well as see if I'm able to scale things with spark and gitbase-spark-connector-e.

EDIT: Ok, so best scheme is no compression when writing the parquet, then tar/gzip the resulting parquet. It achieves a compression rate almost ten times better then when using the gzip by block that parquet uses.

@r0mainK
Copy link
Author

r0mainK commented Jun 13, 2019

Okay, so this answer from Miguel and the following messages discussions with Maartje made quite clear that:

  • There needs to be multiple instances of GitBase, pointing to subsets of the data, if you want to distribute queries.
  • When working with large amounts of repositories like in the case of PGA, using only one instance will not only be slow, but also not work. In my case, I have not been able to query Gitbase successfully when not limiting to a relatively low number the amount of rows returned: 100k is the current max, at 1M the server crashes.
  • While a setup with multiple Gitbase instances exists on the pipeline cluster, given the differences in size of both cluster (32 nodes vs 4), reproducing it on ML is not going to yield much improvements.

This means that:

  • When querying Gitbase on our cluster, nothing can be distributed until the data is in one of the workers, which makes it pretty useless, if not completely, unless working on much smaller datasets, or limiting the amount of rows returned.

So in order to do this task, here is the best plan I can come up with:

  1. Ask infra team for pipeline access
  2. Use pipeline cluster to do this task - there may be the same issue with volumes mounting as on the ML cluster (see link below)
  3. Move the parquet files from the pipeline cluster to the ML cluster, location depending on this issue
  4. Use the index to add a repository_name column to the parquet file, as currently this information is not included and some repositories are split across multiple siva files.

@m09 m09 added the dataset label Jun 18, 2019
@m09 m09 changed the title [dataset] Extract UAST from the HEAD of PGAv2 Extract UAST from the HEAD of PGAv2 Jun 18, 2019
@r0mainK
Copy link
Author

r0mainK commented Jun 26, 2019

Had a meeting with Alex, updated checklist in consequence

@vmarkovtsev
Copy link
Collaborator

Ask Vadim which kind of UAST should be retrieved (native, annotated, semantic)

Semantic.

@vmarkovtsev
Copy link
Collaborator

We are meeting with Máximo tomorrow to discuss the problems, he has suggested to betray gitbase in favor of custom Go solution. Let's see.

@vmarkovtsev vmarkovtsev assigned vmarkovtsev and r0mainK and unassigned r0mainK Aug 5, 2019
@vmarkovtsev
Copy link
Collaborator

vmarkovtsev commented Aug 5, 2019

I coded https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga2uast to do the task.
There are 3 stages:

  • Parse everything with aggressive parallelization. (Vadim)
  • Parse the OOMed sivas with very conservative parallelization. (Vadim)
  • Verify the result and find which files could not be parsed in each HEAD. (Romain)

@vmarkovtsev
Copy link
Collaborator

We used the 7 nodes given by the infra, mining-5 to mining-11.

Size of the result from the first stage: 4.5TB.
Number of OOMed sivas: 211.

Current progress of the second stage: 17/211 in 3 days. This means ETA is 34 days. However, I am using a single node. Once the DockerHub mining is over I will be able to spread the load over all the 11 nodes.

@r0mainK
Copy link
Author

r0mainK commented Aug 9, 2019

Just finished the sanity check on the Siva files that were parsed via aggressive parallelization. My workflow was the following:

  • I tried to load and read contents of a single parquet: success
  • I tried to load and read contents of a whole subdirectory (00): success
  • I tried to load and read contents of all PGA, to compute a CSV with metadata (subrepo, siva hash, repo uuid, list of files): failure

The job failed due to the presence of unreadable parquet files. The error was triggered if I read the specific parquet, or if I tried to count the rows of the DataFrame if loading from a subdir. So I went through each subdir, loading then counting the Dataframe, and if an error was caught I loaded each file in the subdir to find the corrupt ones. Once I identified all the files, I moved then to /spark_storage/pga.v2.corrupted and saved their names in /user/r0maink/sanity/corrupt_pq_1.txt.

Once that was done, I repeated the previous step, with the same result but due to a different error. This one was also due to corrupt parquet files, but simply loading/counting did not trigger the error, I had to actually try to use the contents, for instance by collecting the rows. So I repeated the error finding process, and then moved them to /spark_storage/pga.v2.corrupted_2 and saved their names in /user/r0maink/sanity/corrupt_pq_2.txt.

Once that was done, I repeated the previous step, with the same result but due to a JavaHeapSpace error at about 20-25% of the progress. As I had not optimized the query, knew Infra was gonna work on the cluster today, and did not require working on more then one subdir at a time to compute the CSV, I ran the job on each subdir independently. It just finished after ~12h (this does not really reflected perfs to be expected as I did not try to optimize, group multiple subdirs which would be possible, etc, etc).

Anyway, the CSV weighs 2.02 GB and has 216,752 lines, ie the non-corrupt parquet files contain that much (siva file, repo uuid, file_list) triplets. The total number of files is 32,851,320. By the way, I removed files that had a null UAST, so there might be more in the parquet files, but simply with empty UASTs.

Here are some stats about the corrupt files (as you can see, they were all concentrated in the same 7 sub directories). Given their number, I think it's worth trying to parse the Siva files once more to see if the error was due to the process or something else (you can take the listings on the ML cluster from the locations given above).

# files # non-corrupt files # corrupt 1 # corrupt 2
all subdirs 203,870 203,736 (99.93%) 30 (0.01%) 104 (0.05 %)
subdir 28 825 804 (97.45 %) 4 (0.48 %) 17 (2.06 %)
subdir 2a 806 781 (96.90 %) 5 (0.62 %) 20 (2.48 %)
subdir 2c 828 804 (97.10 %) 3 (0.36 %) 21 (2.54 %)
subdir 2d 847 818 (96.58 %) 6 (0.71 %) 23 (2.72 %)
subdir 2e 810 777 (95.93 %) 10 (1.23 %) 23 (2.84 %)
subdir 2f 850 849 (99.88 %) 1 (0.12 %) 0 (0.00 %)
subdir 65 783 782 (99.87 %) 1 (0.13 %) 0 (0.00 %)
size non-corrupt size corrupt 1 size corrupt 2 size
all subdirs 4.861 TB 4.811 TB (98.97 %) 6.18 GB (0.13 %) 43.98 GB (0.9 %)
subdir 28 16.16 GB 11.19 GB (69.25 %) 1.51 GB (9.37 %) 3.45 GB (21.38 %)
subdir 2a 15.76 GB 10.87 GB (68.97 %) 1.92 GB (12.17 %) 2.97 GB (18.86 %)
subdir 2c 24.62 GB 11.30 GB (45.91 %) 85 MB (0.35 %) 13.23 GB (53.74 %)
subdir 2d 23.51 GB 10.36 GB (44.06 %) 392.95 MB (1.67 %) 12.76 GB (54.27 %)
subdir 2e 19.66 GB 6.54 GB (33.29 %) 1.54 GB (7.86 %) 11.57 GB (58.85 %)
subdir 2f 17.93 GB 17.20 GB (95.97 %) 721 MB (4.03 %) 0 B (0.00 %)
subdir 65 15.91 GB 15.91 GB (99.99 %) ~0 B (0.01%) 0 B (0.00 %)

@vmarkovtsev
Copy link
Collaborator

Great report @r0mainK

This means that I need to re-process a small fraction of files which are corrupted.

@r0mainK
Copy link
Author

r0mainK commented Aug 9, 2019

Thks, yep the listings are in /user/r0maink/sanity/corrupt_pq_1.txt and /user/r0maink/sanity/corrupt_pq_2.txt, if you can put them in a separate directory under /spark_storage/pg1.v2.v2 or something so I can process them directly it would be great. As it's only ~50GB it should not take too long - and hopefully the error will not repeat itself.

@vmarkovtsev
Copy link
Collaborator

The new files were generated and overwritten over the corrupted ones. @r0mainK Please test once again, there shall be no corruptions this time.

I had to write them directly, unfortunately.

@r0mainK
Copy link
Author

r0mainK commented Aug 10, 2019

@vmarkovtsev no problem, anyway I did not know this but when you call a repartition on a DataFrame, it turns out you can't use the built in input_file_name so the 2 colums for the subdir and siva fp were empty in each row -_-"

So I launched the test once again, will post results once I have them.

@r0mainK
Copy link
Author

r0mainK commented Aug 10, 2019

Okay, the job finished in 5h30 (the repartition really was a dumb idea, removing it halved process time), I checked the CSV file this time it's good. It is slightly bigger - especially in term of # files as could be expected from the size of the old corrupted files:

  • 2.06 GB (instead of 2.02 GB)
  • 216,922 lines (instead of 216,752)
  • 33,315,834 files (instead of 32,851,320)

@vmarkovtsev
Copy link
Collaborator

I have created src-d/datasets#158 to list the siva HEADs.

I launched the listing with 32 goroutines on the ML cluster, it digested 17% in 18 hours. ETA 4 days.
I will have to interrupt it on Wednesday though.

@vmarkovtsev
Copy link
Collaborator

I parsed the OOMed sivas. I was able to process 204/211 files. The results are merged with the main dataset.

@r0mainK it is time to run the check again!

Regarding the listing, it is 80% complete. ETA Friday.

@r0mainK
Copy link
Author

r0mainK commented Aug 21, 2019

Awesome, I've relaunched the process with the same config, let's see how it goes - I expect it to be done by tomorrow, unless something goes horribly wrong 🤞

@r0mainK
Copy link
Author

r0mainK commented Aug 22, 2019

@vmarkovtsev extraction completed ! It ran in 6 hours 8 mins, so a bit more then last time. Unfortunately, it was not 100% of all files, there are currently 835 siva/parquet files missing from the old index. I cross checked, and it seems all the missing files where from the 00 subdirectory, which contains a bit over that amount of files, surprisingly.

So I tried to read and count it, and it indeed caused an error. I inspected the directory, there are 3 new files:

-rw-rw-r--. 1 1004 1004  1.8G Aug  3 09:05 00824011103c689db12451a6f73f84b57a6d05e0.parquet
-rw-rw-r--. 1 1004 1004  3.0G Aug  3 08:22 0079cc5fa5b7d13fd201fbae276b01f7f27f8dc9.parquet
-rw-rw-r--. 1 1004 1004   17G Aug  3 02:31 0067e598fa2532b9a914984456d6bff752a0cfd3.parquet

I loaded each one individually and tried to collect them, and you guessed it, the 2 first ones did not cause any error, it was the third 17GB one that did and caused the whole subdir to crash. So I moved that single file to /spark_storage/the_bad_siva/ and afterwards in worked. Anyway, for the sake of comparing true run times (and since we won't have the listing until Friday in all cases) I'm gonna relaunch the whole process, it should be over by this evening - and I'll add final metrics here.

@vmarkovtsev
Copy link
Collaborator

vmarkovtsev commented Aug 22, 2019

It ran in 6 hours 8 mins

I am still listing files in PGA, so the FS performance was degraded.

I have renamed /spark_storage/the_bad_siva/ to /spark_storage/bad_uasts/.

I didn't fully get 835. You are saying that there are 835 siva files under 00 which are in the index but are not extracted, right?

@r0mainK
Copy link
Author

r0mainK commented Aug 22, 2019

I didn't fully get 835. You are saying that there are 835 siva files under 00 which are in the index but are not extracted, right?

@vmarkovtsev no what I meant was that there were 835 missing files in the new index, that where already present in the previous index. This was due to the fact that there was one corrupt file added to the00 subdir that made the job on that subdir completely fail, thus making the 835 files, plus the new ones, not appear in the in new index. But all of those files were extracted successfully, Spark just failed to process them due to that one bad siva.

@vmarkovtsev
Copy link
Collaborator

The file listing is at:

/spark_storage/files_pga.v2

However, the listing has 148230 files compared to 204069 uasts. Weird. I have to re-launch the listing on the missing files.

I renamed /spark_storage/uast_pga.v2 to /spark_storage/uasts_pga.v2

@vmarkovtsev
Copy link
Collaborator

@r0mainK The listing is finally over! 205546 files.

/spark_storage/files_pga.v2

The structure is flat, there are no subdirs.

@vmarkovtsev
Copy link
Collaborator

I set the access for all dirs and subdirs in /spark_storage to 555. We've got

  • sivas_pga.v2 - the original PGA
  • uasts_pga.v2 - UASTs
  • files_pga.v2 - listing
  • bad_siva - temporary, to be deleted

@r0mainK
Copy link
Author

r0mainK commented Aug 26, 2019

I have made a check on the processed index, the 14 subdir was not processed - the spark jobs on it failed with an OOM Java error. I traced back the origin of the problem to a single files (14/147288108757caed09e0c65d9ec098b821129eba.parquet) which I added to bad_uasts directory. Relaunching processing. Once it is done, I will finish up this task.

@r0mainK
Copy link
Author

r0mainK commented Aug 26, 2019

Okay so I finished the extraction, without any errors 💯 I did find there was an issue with parquet file 61/614fa43723122e2a8318d65104991163b9915d72.parquet (it was empty) so I moved it to the bad_uasts folder.

As expected, the CSV file is a bit bigger (2.24 GB), and now contains 218,081 lines (UUID-Siva/Parquet files pairs), and a total of 36,109,756 files across all repos. This means that the stragglers added increased the file count by about 8.3 %. Also, it seems there was some duplication across PGA (most probably some files were processed twice in different UUIDS), as I found distinct 35,991,897 files over 218,023 distinct UUIDs.

I then extracted from the theoretical listings for each repo that Vadim provided the list of files per UUID and, then did the sanity check. Although Vadim has warned the theoretical listings was incomplete, I still found more distinct UUIDs (219,610) and a total of 40,285,913 distinct files.

Anyways here are the results of the sanity check:

file count % of union uuid count % of union
union of both listings 40,603,063 100 % 219,610 100 %
intersection of both listings 35,674,747 87.86 % 218,023 99.28 %
only in parquet listing 317,150 0.78 % 0 0 %
only in theoretical listing 4,611,166 11.36 % 1,587 0.72 %

I also looked into more granular results:

1st quarter Mean Median 3rd quarter
% of extracted files per UUID 80 % 86 % 92 % 99 %

As can be seen, although overall we extracted ~88% of files, the extraction rate varies a lot depending on the repo. AS you can see on the scatter plot below, there seems to be a positive correlation between the number of files in the repo, and the amount that are extracted, but nothing much more. I suspect if we looked at these rates per languages we would probably find most errors come from specific driver.

download

@vmarkovtsev
Copy link
Collaborator

Awesome!

Is it possible to study per language, also to gather repo/paths of files which could not be extracted?

@r0mainK
Copy link
Author

r0mainK commented Aug 26, 2019

@vmarkovtsev was about to edit my message above :p So yeah that's on me actually, I think I had mentionned in meetings it would be useful to have the language of each file in the parquet files, but I forgot to put it down by writing in this issue. So currently, I could only do this using Regexp with the filenames. I already had some experience doing this way back for the apollo blogpost, when we had a similar albeit much smaller dataset, and it was pretty bad.

I think we should just rerun the listing and add this information, if it is possible ? Getting the bytesize of each file would be interesting as well I think. If the processing is as efficient as the first time, we will miss less then 1% of files, and we can get that number further down with regexp. What do you think ?

Also yes, I can create a CSV with the following schema if you want, using the CSVs I've created and the index: subdir,siva_hash,repo_uuid,repo_name,file_name

@vmarkovtsev
Copy link
Collaborator

OK, I will edit the code and re-launch the listing tomorrow.

@vmarkovtsev
Copy link
Collaborator

vmarkovtsev commented Aug 27, 2019

I launched the new listing.

@vmarkovtsev
Copy link
Collaborator

It is funny that we've got 317150 files only in the UASTs. I hope that this time a clean run will be flawless.

@r0mainK
Copy link
Author

r0mainK commented Aug 27, 2019

Yeah, it's surprising, especially as all of those files are in repos that were listed at least in part.

@vmarkovtsev
Copy link
Collaborator

vmarkovtsev commented Aug 29, 2019

Writing this while I remember. An important detail how we should calculate the success rate: it must be calculated on the intersection of <sivas listed> x <uasts extracted>. In other words, we should ignore listed siva files which were not at least partially extracted.

The reason is that a failure to extract a siva is not necessary due to Babelfish: the file can be just too big for the given hardware and algorithm.

@r0mainK
Copy link
Author

r0mainK commented Aug 29, 2019

Ah yeah, nice point I had not considered it 💯

I just recomputed the values, by defining the real union with your definition, ie files in sivas where at least one file was extracted. I found 38,500,270 files, which is 94.82 % of the previous union. This means that:

  • we actually were able to extract 92.66 % of files, not 87.86 % 👍
  • we failed to process ~ 5.2 % of the files, which were from 0.72 % of the repos as indicated before. We would need to look more into detail, but yeah guessing those files were just too big.

@vmarkovtsev
Copy link
Collaborator

@r0mainK The new listing has finished, it is in the same place. 205283 files. As you see, this number is less than the previous one and I am going crazy to find out why. I need to find the missing names and re-launch on them.

@vmarkovtsev
Copy link
Collaborator

Update: I found the missing 801 files, listed them and put to the output directory. Now the overall count is 206084 and we should not have weird files that are present only in the uasts.

Please run the stats per language!

@r0mainK
Copy link
Author

r0mainK commented Sep 2, 2019

@vmarkovtsev running the stats noew, however still missing 97,049 files in the listing. They come from 402 siva files, all of which were in the listing (save for the repos). No idea why they were not listed.
Anyway, the exact list is here: missing.txt (format is subdir,siva_hash,repo_uuid,file_name)

I will update today with language stats

EDIT: okay so this might actually be a bug in my reading of the CSV files (newlines in some of the filenames, just loving it)
EDIT2: yep its the fucking newlines. If there are still any missing will tell you.
EDIT3: Ok ffs this is gonna take me a bit more time, will update after the retreat. Some people are just goddamn dumb, naming files with comas, newlines and quotations marks, which fucked up my CSV. Im gonna recreate the listings taking that into account

@r0mainK
Copy link
Author

r0mainK commented Sep 9, 2019

OK so this is is the final report (hopefully). In the following, I'll be calling extracted rate the ratio of files extracted over all files, and success rate the ratio of files extracted over files in repositories at least partially extracted (ie where Babelfish errored, not something else). I'll also be calling theoretical the listing Vadim created from crawling PGA, and processed the listing I extracted from the Parquet files.

preliminary

First off here are various counts of interest for both listings. As planned, I did not found files in the processed listing that were not in the theoretical listing, however I did found a small amount of files were duplicated in both listings (ie had the same UUID and file path). I do not know why that was the case, I'm guessing something in PGA, or the way we crawled it.

# of sivas # of repos # of files # of distinct files % of duplicates
processed 204,067 218,023 36,162,330 35,991,340 0.5 %
theoretical 206,084 220,174 40,971,787 40,829,244 0.3 %

In the following I'll be computing stats over the distinct files.

extraction and success rate

So overall, a few things to note:

  • As could be expected, extraction rates for Siva files and Repos are the same
  • Babelfish caused us to miss a bit under 6% of files in processable Siva files which accounted for 18 % of the data when measured in bytes
  • Overall we were unable to extract only 12 % of files, however these contained about 45% of the data when measured in bytes
extraction rate success rate
sivas 99.02 % 100 %
repos 99.02 % 100 %
files 88.15 % 94.26 %
bytes 65.37 % 82.12 %

language specific analysis

I ran the same analysis per language as you asked. As you can see, results are clearly unequal. Looking first at files, we can see 3 groups appear:

  • Go, C#, Java and Ruby all have extraction rates above 96.5%, and almost all of the errors were due to Babelfish as can be seen when compared to the success rate.
  • Python, PHP and Shell all are in the 90 % range (Python is a bit above, the others a bit below), most errors are still due to Babelfish
  • finally, both C++ and JavaScript have about 80 % extraction rate, however in the case of JavaScript the vast majority of failures is not due to Babelfish given the high success rate, which is not the case for the C++ drive, which has the lowest success rate as well as extraction rate (as expected though).
file count file extraction rate file success rate
Go 4,126,578 99.88 % 99.88 %
Python 2,994,169 89.70 % 91.93 %
C++ 8,726,368 80.41 % 86.66 %
C# 2,379,754 98.99 % 99.13 %
Java 6,985,742 96.85 % 98.69 %
JavaScript 10,466,131 80.54 % 97.14 %
Ruby 1,143,654 96.70 % 96.76 %
PHP 2,888,395 87.64 % 87.94 %
Shell 1,118,453 87.54 % 88.42 %

Looking now at bytes, we see the same trend as before, ie both rates are lower as the larger files are the ones causing problems. However, there are still some things to note:

  • for half of the languages, both rates are roughly the same. this is not true for Python, C++ and Java, where the sucess rates are a bit higher, and especially for JavaScript where we the success rate skyrockets by 33 %. This means that there is indeed a strong correlation between file size and Babelfish error. I think this is due to the fact that the larger the file, the higher chance the file is bugged, or will cause a bug. so nothing surprising but still.
  • C++, JavaScript, PHP and Shell are the languages where there is the most disparity between rates for files and bytes, ie the most affected by Babelfish errors and/or siav files being excluded. This is particularly true for Shell, JS and to a lesser extent C# and PHP
byte size byte extraction rate byte success rate
Go 56.48 GB 96.12 % 96.13 %
Python 22.84 GB 84.36 % 86.28 %
C++ 22.84 GB 63.69 % 67.19 %
C# 15.43 GB 93.12 % 93.32 %
Java 42.19 GB 95.26 % 98.94 %
JavaScript 227.68 GB 50.09 % 83.06 %
Ruby 3.42 GB 91.56 % 91.72 %
PHP 15.55 GB 71.92 % 72.10 %
Shell 8.26 GB 25.97 % 26.26 %

repo specific analysis

I plotted the same heatmap as in one of the posts above, it seems the strange distribution was due to errors in my code. This time, I found no correlation between the number of files in a given repository, and the fraction of extracted files in that repository. As you can see from the histograms below, most repos were fully extracted (61 % of them), and the ratio of extracted files per repo actually follow an exponential law, further indicating that we're looking at sporadic events. I did not look per language, I'm guessing we'd find the same distribution but more/less pronounced depending on the driver.

image

conclusion

Anyway, I think I've covered more or less everything. If you want me to go more into detail no problem. By the way, here is a zipped CSV with the subdir, siva, repo_uuid, file_path of the 2,191,222 files that apparently caused a Babelfish error (I remove newlines from file paths).

@vmarkovtsev
Copy link
Collaborator

@r0mainK Is it possible to add the language to that CSV file with bblfsh errors?

repo_uuid is the repo name, correct?

@r0mainK
Copy link
Author

r0mainK commented Sep 9, 2019

@vmarkovtsev Added language and repository name. Left the PGA metadata in case, anyway now schema is subdir, siva, repo_uuid, repo_name, lang, file_path.

The zipped CSV can be found here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants