-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perceval version 0.21.2-rc.4 returns error on parse_mbox(), perceval_parsed is empty in Social Smells Notebook #145
Comments
Hi Leilani, Could you try running the command for Perceval directly on terminal? i.e. perceval_output <- system2(perceval_path,
args = c('mbox',mbox_uri,mbox_path,'--json-line'),
stdout = TRUE,
stderr = TRUE) # set to true Should be on terminal:
Where <mbox_uri> and <mbox_path> are the values of the variables you would just paste directly to the terminal. What this code line is doing is really just pasting it on the terminal on your behalf, and then ingesting the json back into R. If on the terminal you also get an empty table, then the issue lies within Perceval itself or the data, instead of Kaiaulu. This will also help us see if Perceval outputs any other error that may help us diagnose it. |
Also, could you attach the thrift config file you made here so I could check? |
Hi Carlos, I seem to be getting an empty table again. I believe it's an error with Perceval because I tried the command on different mbox files, but I get the same issue. This is the issue: I've attached my thrift project configuration file. I copied the helix.yml file in Kaiaulu and made small changes for thrift.yml. In particular, I have only changed the version_control and mailing_list sections so far. Is there a certain version of Perceval I need for Kaiaulu? The last one I tried was 0.21.2-rc.4, the newest version. Note: Attached file is a .txt instead of .yml because .yml wasn't allowed. Also, here's a link to download the mbox I used: |
My perceval version is 0.12.24 if you would like to try and see if that is the issue. Try to also just "cd" to the folder where your thift-dev.mbox is, and try to run perceval there, e.g.: cd /Users/cvp/Desktop/sailuh/rawdata/mbox
/Users/cvp/perceval/bin/perceval mbox thrift-dev thrift-dev.mbox --json-line I'd recommend you follow the folder structure used in Kaiaulu, just to make it easier for us to reference each other's file path. More specifically, all Kaiaulu config files assume you have a folder structure of the following format: /path/to/sailuh/kaiaulu In this organization, the config file, specify the mbox path to: ../rawdata/mbox/thrift-dev.mbox If you are running from the terminal or ../../rawdata/mbox/thrift-dev.mbox If you are trying to compile the vignette. Sometimes the relative path an be misleading when using in R Studio. If you compile the notebook, it will assume you are within /vignettes. If you execute from the terminal section, it usually assumes you are on kaiaulu folder. Let me know if you still can't get it to run even running my version of Perceval or trying the above. |
Hi Carlos, |
@leilani-reich Could you also open an issue on Perceval with the problem and link the issue here? It should suffice to provide them with the .mbox file URL, and the command you tried to execute on the terminal. It would be helpful to know from them what is causing the issue, otherwise in the future, we may be unable to maintain this integration with Perceval. On our part, I will note the Perceval version to the README, so thank you for spotting this :) |
You're welcome! Issue is posted here: chaoss/grimoirelab-perceval#810 |
Hi Carlos, aside from using Perceval version 0.12.24 for Kaiaulu, we can implement the advice from the related grimoirelab-perceval issue into code within the parse_mbox() function directly so parse_mbox() will run smoothly for a newer version of Perceval. In particular, we can change the following in parser.r (works with Perceval version 0.12.24 but not 0.21.3): Lines 250 to 253 in f66bc0f
To this (works for newest Perceval version 0.21.3 but doesn't work with 0.12.24):
Is this something you think would be beneficial to change, or would it be best to just stick with Perceval version 0.12.24 for Kaiaulu? Thanks, |
Hi Leilani, Since the solution provided was a temporal fix, I suggest we stick to 0.12.24 until the fix is implemented on their end. |
That makes sense. Thank you! I will look out for updates on this. |
Hi @carlosparadis ! I've followed all of the instructions related to this issue, made sure I have the 0.12.24 version, tested out perceval from the vignettes/ directory and it works just fine. The comment that I used to test out However, when I try to execute the .Rmd file, I got My set up in helix.yml is I'm running the .rmd file inside of Rstudio. Any help is appreciated! |
@lh-zhan Hi Zhan! Thank you for the interest in the tool :) This (misleading) error Kaiaulu generates is usually associated with the .mbox file not being found (similar to #108). Or it could also be you are trying to execute on an .mbox archive that doesn't have the columns Kaiaulu tries to rename. I am assuming parse_mbox() is the function giving you trouble. So we could narrow the possibilities, could you try running the parse_mbox() function inline? In essence, execute the Notebook up to and before the parse_mbox() function. In. another tab, open the file R/parser.R and locate this function: Line 247 in 716a46e
Then, load the parameters the function receive, and simply run line by line the code in the function until you encounter the error. For example, check if the loaded object Otherwise, your error likely lies in this region: Lines 261 to 273 in 716a46e
You can then check, if for example, Kaiaulu is trying to rename columns that don't exist, by typing: colnames(perceval_parsed) to see if the columns in the data don't exist. Let me know if this helps narrow down the issue, otherwise we can iterate here. Thanks! |
Thank you for your swift reply! I was able to find out that the Paths that I've tried: Can this issue be potentially caused by Thanks!! |
@lh-zhan Fantastic! Glad we could narrow down the problem. So our issue is narrowed down to: Lines 254 to 257 in 716a46e
Could you try using the full path starting on root for Lines 249 to 250 in 716a46e
should be handling that, but trying manually would be interesting to see. system2 is a base R function, so what we need to figure out is why R is not being able to find the file, since it apparently is finding the perceval path from tools.yml on your computer but not just the file. Could you also experiment running with the flags values to TRUE on stderr? (e.g. try stdout = FALSE, and stderr = TRUE to see if it prints the terminal error). Lines 254 to 257 in 716a46e
That may give us further insight. |
Hey @carlosparadis Appreciate the instruction, I think we are one step closer! I tried using the full path on root for mbox, and run Line 254 in 716a46e
this line didn't throw any error, I printed out everything in console and they are identical as I run perceval in my terminal. But the next line Line 259 in 716a46e
Gave me
I think we've tracked down the root of the issue here, seems like a parsing format error here but sadly I'm unfamiliar with the data.table function used here. |
Would you be able to send me the output generated by: jsonlite::stream_in(textConnection(perceval_output),verbose=FALSE) e.g. perceval_output <- jsonlite::stream_in(textConnection(perceval_output),verbose=FALSE)
jsonlite::write_json(some_filepath,perceval_output) Maybe via hyperlink to a google drive or dropbox? I want to try and replicate it on my end. From the error, it seems the problem is that data.table doesn't know what This could be addressed by: perceval_output <- jsonlite::stream_in(textConnection(perceval_output),verbose=FALSE)
perceval_parsed <- ....... # More careful field parse happens here As for data.table, it is a more optimized library implemented in C/C++ than R base data.frame. I am curious about why you are encountering this error, however, since you are also using a version of Perceval 3 others are using, and have not encountered this error. Hence, the data request so I could compare the files. |
Sorry about the late reply, I didn't get a chance to check my emails today until now(EST time :( ) I got an I've placed the file at https://drive.google.com/file/d/1_tCDsakklY8piqwPgBG23Fb1FGjINe2P/view?usp=sharing It does seem like the JSON object has syntax error as when I attempt to open it in Firefox it gives: You can view it as Raw Data to view the entire message. Thanks! |
No worries! Is this the data you have been getting all along? Because if so, the problem is not with Kaiaulu, but just the fact the Perceval command is not executing correctly. Perceval can sometimes execute and create a file, even if it can't find the data. However, if you look at the data file generated you will see it failed: [2023-04-09 23:22:05,955] - Sir Perceval is on his quest.
[2023-04-09 23:22:05,955] - Looking for messages from '/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org' on '/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org.mbox' since 1970-01-01 00:00:00+00:00
[2023-04-09 23:22:05,957] - Done. 1/1 messages fetched; 0 ignored
[2023-04-09 23:22:05,957] - Fetch process completed
[2023-04-09 23:22:05,957] - Summary of results
Total items: 1
Items produced: 1
Items skipped: 0
Last item UUID: 0a91037c482d476b4fdfd5f35d3e9c9534eebcba
Last item date: 2023-03-29 12:42:18+00:00
Min. item date: 2023-03-29 12:42:18+00:00
Max. item date: 2023-03-29 12:42:18+00:00
Min. offset: - Max. offset: - Last offset: -
[2023-04-09 23:22:05,957] - Sir Perceval completed his quest.
{"backend_name":"MBox","backend_version":"0.12.0","category":"message","classified_fields_filtered":null,"data":{"Authentication-Results":"apache.org; auth=none","Content-Transfer-Encoding":"quoted-printable","Content-Type":"text/plain; charset=\"UTF-8\"","Date":"Wed, 29 Mar 2023 08:42:18 -0400","Delivered-To":"[email protected]","From":"Rich Bowen <[email protected]>","List-Help":"<mailto:[email protected]>","List-Id":"<dev.helix.apache.org>","List-Post":"<mailto:[email protected]>","List-Unsubscribe":"<mailto:[email protected]>","MIME-Version":"1.0","Mailing-List":"contact [email protected]; run by ezmlm","Message-ID":"<[email protected]>","Organization":"The Apache Software Foundation","Precedence":"bulk","Received":"from [192.168.21.205] (unknown [52.95.4.13])\n\tby mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPSA id 8898A3EE47;\n\tWed, 29 Mar 2023 12:42:19 +0000 (UTC)","Reply-To":"[email protected]","Return-Path":"<dev-return-7519-archive-asf-public=cust-asf.ponee.io@helix.apache.org>","Subject":"A Message from the Board to PMC members","To":"\"[email protected]\" <[email protected]>","User-Agent":"Evolution 3.46.4 (3.46.4-1.fc37) ","X-Original-To":"[email protected]","body":{"plain":"Dear Apache Project Management Committee (PMC) members,\n\nThe Board wants to take just a moment of your time to communicate a few\nthings that seem to have been forgotten by a number of PMC members,\nacross the Foundation, over the past few years. Please note that this\nis being sent to all projects - yours has not been singled out.\n\nThe Project Management Committee (PMC) as a whole[1] is tasked with the\noversight, health, and sustainability of the project. The PMC members\nare responsible collectively, and individually, for ensuring that the\nproject operates in a way that is in line with ASF philosophy, and in a\nway that serves the developers and users of the project.\n\nThe PMC Chair is not the project leader, in any sense. It is the person\nwho files board reports and makes sure they are delivered on time. It\nis the secretary for the project, and the project\u2019s ambassador to the\nBoard of Directors. The VP title is given as an artifact of US\ncorporate law, and not because the PMC Chair has any special powers. If\nyou are treating your PMC Chair as the project lead, or granting them\nany other special powers or privileges, you need to be aware that\nthat\u2019s not the intent of the Chair role. The Chair is a PMC member peer\nwith a few extra duties.\n\nEvery PMC member has an equal voice in deliberations. Each has one\nvote. Each has veto power. Every vote weighs the same. It is not only\nyour right, but it is your obligation, to use that vote for the good of\nthe project and its users, not to appease the Chair, your employer, or\nany other voice in the project. \n\nEvery PMC member can, and should, nominate new committers, and new PMC\nmembers. This is not the sole domain of the PMC Chair. This might be\nyour most important responsibility to the project, as succession\nplanning is the path to sustainability.\n\nEvery PMC member can, and should, respond when the Board sends email to\nyour private list. You should not wait for the PMC Chair to respond.\nThe Board views the entire PMC as responsible for the project, not just\none member.\n\nEvery PMC member should be subscribed to the private@ mailing list. If\nyou are not, then you are neglecting your duty of oversight. If you no\nlonger wish to be responsible for oversight of the project, you should\nresign your PMC seat, not merely drop off of the private@ list and\nignore it. You can determine which PMC members are not subscribed to\nyour private list by looking at your PMC roster at\nhttps://whimsy.apache.org/roster/committee/ Names with an asterisk (*)\nnext to them are not subscribed to the list. We encourage you to take a\nmoment to contact them with this information.\n\nThank you for your attention to these matters, and thank you for\nkeeping our projects healthy.\n\nRich, for The Board of Directors\n\n[1] https://apache.org/foundation/how-it-works.html#pmc-members\n\n"},"unixfrom":"dev-return-7519-archive-asf-public=cust-asf.ponee.io@helix.apache.org Wed Mar 29 12:48:18 2023"},"origin":"/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org","perceval_version":"0.12.24","search_fields":{"item_id":"<[email protected]>"},"tag":"/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org","timestamp":1681096925.95702,"updated_on":1680093738.0,"uuid":"0a91037c482d476b4fdfd5f35d3e9c9534eebcba"}
Was this also what you get when executing Perceval via the terminal? If so, you should try testing it again via Terminal that the parameters passed work. Otherwise, if that was not the intended file, let me know. I should have asked this in advance: What OS are you using? OS X? Also, I do not believe the above file is a JSON file. So, it makes sense any json parser would fail too. |
Hey! I'm on macOS Monterey. Interestingly, when I run the command from terminal, it returns a valid and parseable json object as the following:
Taking another look at the JSON file that was saved, all the outputs were stored upside down. By that I meant,
This summary got stored at the beginning of the file as opposed when we execute the same command in the terminal, the above summary will be displayed at the end. And yes, you are correct on the file is not a parseable JSON object, I think what happened was, all data was condensed into one large paragraph which led to an unparseable JSON. |
I just tested on my end parse_mbox() and i got the following parameters: perceval_path = This is how the start of the file should look like: ["{\"backend_name\":\"MBox\",\"backend_version\":\"0.12.0\",\"category\":\"message\",\"classified_fields_filtered\":null,\"data\":{\"Content-Disposition\":\"inline\",\"Content-Type The file is about 316.1MB, and it should be usable by jsonlite::write_json() too from perceval_output. Note that, when you are using RStudio knitr to compile the notebook, and when you are using the Terminal in RStudio your current path differs. The project configuration file assumes your current path is in kaiaulu/vignettes/notebook.Rmd, and hence Failing to do so, will result in no files being parsed due to the file path being incorrect (I believe this was the original error). The second error you are experimenting I am not clear since I can't reproduce. Maybe, check when you are testing manually that your parameters I'd suggest you also empty your environment and only try to use the system2 call with the parameters matching what you have on terminal. R can use variables defined outside function scope if it doesn't find them first in the function. This can sometimes leads to a lot of confusion on diagnosing what went wrong. |
Sounds like a plan! I will check later on helix.mbox to see what is not working. I take you used the .mbox file from Codeface, right? |
I didn't, I went to https://lists.apache.org/[email protected] to download mbox from different months to test things out. Now that you mentioned it, I see there is a download link in thrift.yml for thrift mbox, but couldn't find one in helix.yml. Would be very beneficial to specify which helix mbox to download for testing purpose :) |
@lh-zhan Honestly, you are better off downloading from where you pointed, as that would be the only way to ensure the dataset is current. The other source I had was from a supplemental material. It is good for testing things out, but it won't take you too far for actual analysis. I believe I identified the issue and pushed some code changes on #185. Would you mind verifying if that fixes the problem for your original dataset and follow up on that issue, so I could close? I expect you will encounter the same problem in other mailing lists, as it seems the .mbox fields do vary between projects (contrary to .git that have consistent fields). It works for helix, likely because it contained the same fields as OpenSSL. The fix should hopefully make the function future proof for any project. |
Since using the older version of Perceval works for now, I am closing this issue. |
Problem:
While running social_smell_showcase.Rmd on the Apache Thrift github repository (https://github.com/apache/thrift), on line 139
kaiaulu/vignettes/social_smell_showcase.Rmd
Line 139 in f66bc0f
I got the following error:
Error in data.table::setnames(perceval_parsed, "data.body", "reply_body") :
Items of 'old' not found in column names: [data.body]. Consider skip_absent=TRUE.
Steps I took to try and fix it:
-Checked my perceval_path and mbox_path and made sure they were correct and relative to my working directory
-Checked my perceval installation & tried different installation methods
-Used a different mbox file. First I tried the one provided for the Apache Thrift project by Rick. Then, I used the mbox here https://lists.apache.org/[email protected].
-Tried running the notebook on another project. In particular, I tried to run the notebook for the Apache Helix github repo (https://github.com/apache/helix), but that didn't seem to make a difference.
My findings:
I looked at the source of the parse_mbox() function in parser.R, and I tried to learn more about the error.
kaiaulu/R/parser.R
Lines 243 to 281 in f66bc0f
One thing I noticed was that if I set stderr=TRUE in line 250-253 of this function
perceval_output <- system2(perceval_path,
args = c('mbox',mbox_uri,mbox_path,'--json-line'),
stdout = TRUE,
stderr = TRUE) # set to true
Then I get the following error message:
Error: parse error: after array element, I expect ',' or ']'
[2023-01-26 22:19:07,492] - Sir Perce
(right here) ------^
I know the "[2023-01-26 22:19:07,492]" seems to refer to the time I ran the notebook. I believe "Sir Perce" refers to perceval, but I don't understand why perceval is giving this error.
Also if I set verbose=TRUE in line 255 of this function:
line 255) perceval_parsed <- data.table(jsonlite::stream_in(textConnection(perceval_output),verbose=TRUE)) # changed verbose to TRUE
The console output says "Imported 0 records. Simplifying...".
So perceval_parsed is an empty table, which I believe explains why I was getting this error from before:
Error in data.table::setnames(perceval_parsed, "data.body", "reply_body") :
Items of 'old' not found in column names: [data.body]. Consider skip_absent=TRUE.
Referencing this documentation for the setnames() function, https://rdrr.io/rforge/CALIBERdatamanage/man/setnames.html, I see that "old" represents the old columns of the dataframe, and since the dataframe was empty, "old" wasn't found.
The problem I have is I don't understand why perceval_parsed ends up being empty.
To replicate error:
-clone Kaiaulu and install Kaiaulu package
-clone Apache Thrift repo (https://github.com/apache/thrift)
-set up the project configuration file for thrift.yml following the example here (https://github.com/sailuh/kaiaulu/blob/master/conf/helix.yml). Specifically, I just set up the version control and mailing list section
-Within, social_smell_showcase.Rmd, change tools_path and conf_path (here:
kaiaulu/vignettes/social_smell_showcase.Rmd
Lines 58 to 59 in f66bc0f
-run the social smells notebook
The text was updated successfully, but these errors were encountered: