Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: future of Scribe-Data as a package #26

Closed
2 tasks done
wkyoshida opened this issue Nov 17, 2022 · 22 comments
Closed
2 tasks done

Discussion: future of Scribe-Data as a package #26

wkyoshida opened this issue Nov 17, 2022 · 22 comments
Assignees
Labels
question Further information is requested

Comments

@wkyoshida
Copy link
Member

Terms

Issue

Opening this issue to continue the discussion from here and here. I thought it could be good to move it to a separate location where it could continue even after the linked PR gets merged.

I'm thinking this issue can be used to generally discuss what Scribe-Data could provide to a greater community as a package. Some related topics could include:

  • What about converting Scribe-Data as a language pack-generation tool available as a package?
  • What about simply making the language packs themselves as the packages instead?
  • Where to make the packages available? Currently in pypi
  • Possibly renaming the packages from scribe-data to something more descriptive of their purpose?
  • What changes would be needed to make Scribe-Data less tightly coupled to the other Scribe repos?
  • Whatever becomes of Scribe-Data, how to integrate it into the rest of Scribe to continue serving Scribe's data needs?

Some thoughts were already made in the linked PR, but we can resume here. I agree that Scribe-Data could potentially become something useful for a larger audience 🧑‍🤝‍🧑 🙌 Continuing this discussion could be good to see what that could be.

@andrewtavis andrewtavis added the question Further information is requested label Nov 21, 2022
@andrewtavis
Copy link
Member

Hi @wkyoshida :) Firstly, apologies for my relative absence the last couple days. Lots going on between getting healthy, work and the rest of life... 😊

This issue will definitely help, thanks for it! I had discussions enabled on Scribe repos before, but I feel that it only divides the discussion until we actually have more conversations happening. As far as the next two points are concerned:

  • What about converting Scribe-Data as a language pack-generation tool available as a package?
  • What about simply making the language packs themselves as the packages instead?

Would we be able to do both? Say that we have up to date versions of the packs that are available for direct download, and the package also has Python codes that generate the packs? We could keep the current structure of extract_transform and load in this case?

  • Whatever becomes of Scribe-Data, how to integrate it into the rest of Scribe to continue serving Scribe's data needs?

To me it makes sense that we make downloading the packs general and adapt the way that language packs are downloaded within the app. Should be doable 😊

@andrewtavis
Copy link
Member

Here's the LanguagePacks logo I made, btw 😉

Language Packs Logo

See NumPy for the inspiration (not sure if you know their logo 🙃). I think we could do the same for LanguageData as well, if that sounds better :)

@andrewtavis
Copy link
Member

andrewtavis commented Nov 23, 2022

Further thoughts on this:

  • Within load we could have a packs or data directory where the formatted outputs are saved
  • Within this directory we’d want directories for languages, within them directories for word types, and then JSONs that are for example verbs.json and maybe verbs_rdb.json
    • Purpose of the rdb files is to save it in a format that’s easy for loading into relational databases
    • We’ve been talking that the outputs for Scribe would ideally be changed so that it’s for example infinitive as a key in the JSON with the value being a list that can be loaded as a column directly - same for all the conjugations, noun genders, etc - rather than loading everything row by row from infinitive-conjugations pairings and the like
    • With that being said, I think the original structure should also be saved as some other projects might have use for it 😊
  • From there we change the loading script in Python such that we have maybe an _update_packs.py and a load_packs.py
    • The former updates the local files in the package and the latter is used to access them via the GitHub API and save them somewhere
    • We then put a simple Python file in Scribe-iOS that for now we still need to run manually after an update that would import the package and load the desired data from GitHub
    • A future step then would be automating the update step via GitHub actions (or switching to Toolforge, but I’m considering that GitHub might be best so that we get more visibility/hopefully help)
    • We could then talk manual/regular updates to iOS outside of updates or whatever else 🚀

@wkyoshida
Copy link
Member Author

I'll later add some thoughts to the above as well, but real quick, regarding the discussion here:

Adding a popularity score into the output would be enough so that we can order it on the Swift side, as I figure ordering the results here would leave us open to it not being maintained when it’s added to the DB.

My initial inclinations were to store the emoji data with the below format:

{
"emoji": "👀",
"is_base": false,
"rank": 38
},

My thinking was that could make more sense over storing rank and is_base in a separate JSON, which could be a different approach. This approach with a separate JSON, with the addition of SQLite, could simply correspond to a separate DB table, and the Scribe keyboards could follow the workflow of:

  1. Determine which emojis could be auto-suggestions based on user input
  2. Quick look-up these auto-suggestion emojis in the separate JSON/table
  3. Do additional processing based on what is retrieved for their is_base and rank values

The initial inclination (and JSON format today) made more sense before as I was imagining that, for each auto-suggestion trigger keyword, Scribe-Data would already do some pre-processing to only save the (for instance) 3 most popular emojis for that trigger keyword OR to already order the possible emoji suggestions by rank.

  • ❌ I'm thinking a potential issue with not doing this pre-processing in Scribe-Data could occur with trigger keywords that have a multitude of possible emojis. Mostly thinking it could be best to help not leave this processing to the app-side - instead, making the determination of which emoji to auto-suggest as easy as possible on the app-side.
  • ✔️ However, with the new idea to make Scribe-Data a package for more general use though, I'm thinking that this format for that purpose seems to be a bit better over separating JSONs, since a user's case won't necessarily involve the specific SQLite needs/implementation that Scribe has.
  • 🟡 Saving only the 3 most popular emojis may not be ideal for a general-use package. However, perhaps ordering could still be useful for that purpose.

With the separate JSON/table idea:

  • ✔️ The issue with doing the processing on the app-side could be lessened, since on the app-side, the workflow could simply make use instead of quick look-ups in the separate JSON/table for the emoji's rank and is_base.
  • ❌ As already mentioned, the format of separate JSONs could perhaps not be ideal if Scribe-Data is a general-use package.

Mostly just adding some thoughts here regarding the format of the emoji suggestion data, since the work completed for the feature was mostly done before the idea of Scribe-Data as a general-use package was around. Wondering now if the general-use package idea could influence anything related to how Scribe-Data processes/stores the emoji suggestion data.

@andrewtavis
Copy link
Member

I'm kind of mixed between some of these points. I'd say for making it work for many people, we'd definitely not want to just save the top 3, but then this is being covered by #28 where they can choose how many they'd like :) So on the generation side definitely keep it open.

With that being said, what Scribe-iOS saves can totally be only what it needs so long as we make sure that the order of the outputs is maintained correctly, which should be very doable. So if we want to just get rid of rank in the output somewhere and save them how we'd use them, then let's go for it 😊 As long as what we're using is directly generated by the package, all fine by me :)

Maybe an option is to do this in a similar way to verbs where we'd just have keys that are emojis, is_base and rank, with the values then being loaded as columns. Might be simple for us to send that over, load it into a local DB, and then delete it so that we don't need to load it in to memory all the time? But then I'm not sure if we can just set up an SQLite table that sits on their device. Likely is going to be that we still have all of this sitting as JSONs on their device that then get loaded when the keyboards are used.

@andrewtavis
Copy link
Member

andrewtavis commented Nov 30, 2022

Hey @wkyoshida! Here are the points from our discussion yesterday with some updates:

  • I checked with Mahir who's the main lexicographical data admin and he supports us making Scribe-Data more general and accessible to wider use cases
  • We talked about what the long term goal for Scribe-Data is:
    • We'd have information, lexeme ids and a last updated timestamp to allow us to check what to send to the user
    • The interim solution is downloading something into the app, loading it into SQLite, and then deleting the original data
    • The general thought on the structure is Wikidata -> Scribe-Data -> Toolforge/Wikibase -> Scribe-Server -> Scribe apps
      • Scribe-Data would be the package that would extract and format the data in a general way (ETL)
      • Toolforge could be used to load the package and save the packs
      • We need to check if using Wikibase for these kinds of packs would be acceptable
        • I'm thinking that saving a copy of the data might actually be frowned upon
        • Maybe we can do a work around and save update information, thus making the update queries faster? (filter only for the lexemes that we know they need)
      • Scribe-Server would load it into the apps and be Scribe specific (ELT)
  • I'll be making the projects as we discussed and linking those to the readmes

Let me know if I forgot anything above! Great talking to you! 😊

@andrewtavis
Copy link
Member

@wkyoshida, let me know what you think of the current projects as well as how the road map is looking with them listed rather than the versions and issues :)

@wkyoshida
Copy link
Member Author

  • ... JSONs that are for example verbs.json and maybe verbs_rdb.json
    • Purpose of the rdb files is to save it in a format that’s easy for loading into relational databases

I'm thinking this could likely be implemented as an option 👍 So, users would be able to specify outputs in both formats or either format.

I'd say for making it work for many people, we'd definitely not want to just save the top 3, but then this is being covered by #28 where they can choose how many they'd like

As we discussed in our call, the functionality to "save the top 3" and what is covered in #28 refer, in fact, to separate output customization ideas. The "save the top 3" idea, though, could definitely be implemented as another option to specify how many top 'n' popular emojis to save per trigger keyword 👍

We need to check if using Wikibase for these kinds of packs would be acceptable

The way I'm thinking the data could be saved is with a database in Scribe-Server [1] itself (this appears that it could be a possibility with Toolforge databases). So, I'm thinking perhaps that Scribe-Server could do something like:

  • At a determined frequency, use the Scribe-Data package to generate the packs
  • Process/load the new data from the packs into the Scribe-Server DB
  • Store the packs following guidance from Hosting large files. Scribe could likely be able to keep files under the 1GB limit, even breaking apart the packs by function to limit the size of individual files.
  • Delete whatever older, previous generations of the packs were in storage before

Then, when a user requests for new data for an existing keyboard:

  • Only send the newer data that they actually need, filtering the DB query with a last_updated parameter perhaps

When a user requests for data for a new keyboard:

  • Serve instead the stored pack (mentioned above in 'Hosting large files') as they would be the complete most recent data

@wkyoshida, let me know what you think of the current projects as well as how the road map is looking with them listed rather than the versions and issues :)

Projects are up 🙌 Nice!! Thanks for looking into them

A quick suggestion for projects - I was thinking actually of even having something of a 'main' project that has all the work represented in it. Was thinking of using something like a priority field to rank the importance of issues/work; that way, potentially eliminating the need to keep track of this in the README.md as it is now. Any thoughts on this?

Also, quick other note, it seems that when accessing Projects, GitHub by default takes you to your Recently viewed projects specifically, which I'm not a huge fan of really. I'm thinking that linking to the full list of Projects might make more sense. This can be done by linking to the below instead:
https://github.com/orgs/scribe-org/projects?query=is%3Aopen

Great talking to you! 😊

Likewise, good sir!!


[1] Referring to Toolforge/Wikibase as well here, since that is perhaps where Scribe-Server will be hosted.

@andrewtavis
Copy link
Member

I'm thinking this could likely be implemented as an option 👍 So, users would be able to specify outputs in both formats or either format.

Makes total sense to me! 😊

The "save the top 3" idea, though, could definitely be implemented as another option to specify how many top 'n' popular emojis to save per trigger keyword 👍

Ah yes, I remember now :) num_emojis is for how many from the top we're getting, and then we can do a emojis_per_word or something like that for the one I was thinking about :)

A quick suggestion for projects - I was thinking actually of even having something of a 'main' project that has all the work represented in it. Was thinking of using something like a priority field to rank the importance of issues/work; that way, potentially eliminating the need to keep track of this in the README.md as it is now. Any thoughts on this?

Makes sense to me 😊 Will edit them now to get that in there!

Also, quick other note, it seems that when accessing Projects, GitHub by default takes you to your Recently viewed projects specifically, which I'm not a huge fan of really.

Was annoying me as well! 😄 Will change the links :)

@andrewtavis
Copy link
Member

@wkyoshida, how does the new road map section read? I think that the idea of doing a main and branch projects will be sensible to folks, and will allow us to add a bit of context to the main project board so we know if some issues are related. Sadly you can't add links to a drop down so that we can link directly to the branch ones, but maybe some day 😇

Lemme know what you think!

@andrewtavis
Copy link
Member

Or ya know, I think that the branch project idea can be fixed by simply using the filters, so I'll delete the other projects and just make Sub Project filters.

@andrewtavis
Copy link
Member

I think it reads much better now and will be much more useful going forward. Let me know if you have some suggestions for more columns 😊

@wkyoshida
Copy link
Member Author

Or ya know, I think that the branch project idea can be fixed by simply using the filters, so I'll delete the other projects and just make Sub Project filters.

Was going to suggest this 😆 I think it's looking great, @andrewtavis!

Some other ideas could also be:

  • Using numbers for the priority field instead, e.g. highest could be 1. More thinking that this alternative could provide more granularity when it comes to ordering the issues. With Low, Medium, and High, there's really only three priorities, so kinda have to clump some issues together. Numbers could show more defined ordering if needed (doesn't have to be so every issue has a unique number, but the option for granular ordering would be there)
  • The Group by values option. Could be, for instance, having the issues grouped by their Priority or the Sub Project.

The above are more just other ideas I had, but I wasn't thinking they were needed per se, more so if they end up making sense potentially in the Projects view.

@andrewtavis
Copy link
Member

andrewtavis commented Dec 7, 2022

Are you able to group by values yourself, @wkyoshida? I think that the baseline view being a group by of Priority is a good idea, but am just wondering if folks can do it themselves and just not have the option to save. Switched it over to numbers with the colored circle emojis so it's clear that 1 is the highest with green, and then using yellow, orange, red and black for the rest :)

Looks much better, thanks for the suggestions!

@wkyoshida
Copy link
Member Author

Are you able to group by values yourself, @wkyoshida? I think that the baseline view being a group by of Priority is a good idea, but am just wondering if folks can do it themselves and just not have the option to save.

Yup! That's the behavior I'm seeing. Can modify filters/groupings on my own view if desired, but doesn't save. I think it's looking good! 🙌

1 is the highest with green, and then using yellow, orange, red and black for the rest :)

If another color is needed, thinking that 🟣 could be an option in between 🔴 and ⚫
🤷 perhaps..

@andrewtavis
Copy link
Member

We can definitely do 🟣 if we need another one 😊

Sorry for not getting to things! I've been having a whirlwind of a last few weeks, but there's some amazing news for me (and Scribe for that matter 😊). Will let you know on a future call 🙃

Will get to the PR in the coming days! 🚀

@wkyoshida
Copy link
Member Author

but there's some amazing news for me (and Scribe for that matter 😊). Will let you know on a future call 🙃

How exciting! Look forward to hearing about it!

@andrewtavis
Copy link
Member

andrewtavis commented Apr 8, 2023

Putting a random thought I had in here as well 😊

Something that I've been considering is how link downloaded custom keyboard data to a custom keyboard extension within Scribe-iOS. As of now I'm not sure how to download an extension into an app. There should be a way, but the big thing is then that we need to make it accessible within their phone's keyboard settings 🤷‍♂️ This might be hard/annoying 🤔

What we could however do is create a baseline keyboard view that's loaded by keyboards that do not have data files. So the user would go into their settings and select the keyboard they want. Ideally they'd then or have already downloaded the Scribe-Data pack, but if not then they could still switch to the keyboard and would then be prompted with a keyboard sized field that says "Hey you need to download the data for this keyboard — press the big button!" We could then add some basic UI in there to show the download length, and once the data is downloaded the interface would update with the keyboard :)

@wkyoshida
Copy link
Member Author

Sorry, I've been trying to understand what you're getting at here @andrewtavis

So, the core issue, if I'm understanding, is that:

  1. Today, there isn't an issue when a user selects a Scribe keyboard from the iOS keyboard settings, since the data for the Scribe keyboards all come downloaded already when a user installs Scribe.
  2. In the future however, we may have an issue when we decouple the data away from the Scribe app and make it possible to download the data instead. This may be, since we still want to allow the user to select the Scribe keyboard from the iOS keyboard settings - even they have not yet downloaded the data pack for it.

Did I get it?

If I did, then yeah! I think that your idea of using a baseline keyboard makes sense. I'm thinking that a keyboard with the correct keys layout (QWERTY, AZERTY, etc.) could still load and still function as a regular keyboard. The difference could be that the Scribe command bar simply doesn't work. The prompt to go download the data could go where the command bar would go.

@andrewtavis
Copy link
Member

In the future however, we may have an issue when we decouple the data away from the Scribe app and make it possible to download the data instead. This may be, since we still want to allow the user to select the Scribe keyboard from the iOS keyboard settings - even they have not yet downloaded the data pack for it.

Generally yes :) I just don't know how to also download that menu option into their settings to select Deutsch within Scribe in their keyboard settings - so only give them the option to install those keyboards as input methods that they do have data for. The option will continue to be there, and without the data the keyboard won't have the features.

I'm thinking that a keyboard with the correct keys layout (QWERTY, AZERTY, etc.) could still load and still function as a regular keyboard. The difference could be that the Scribe command bar simply doesn't work. The prompt to go download the data could go where the command bar would go.

This is great 😊 This is exactly what we should do! Thanks, @wkyoshida 🙌

@andrewtavis andrewtavis moved this from Todo to In Progress in Scribe Board Oct 30, 2023
@andrewtavis
Copy link
Member

andrewtavis commented Oct 30, 2023

@m-charlton: I wanted to bring this issue to your attention as well :) I think that this discussion in general will give you a very good understanding of what we're planning on the Scribe-Data side while also including some bits that link it to iOS and Server (some points - read "my random logo tangents" - are definitely skimmable though 😅). Having all three of us in here will also help with the planning of some future issues in line with us transitioning this software from something that I run locally to more of an all purpose Wikidata language data extraction tool (something the Wikidata Lexemes community has welcomed).

Initially I think that #47 is something to take a look at generally as Scribe-Data is in much better shape than it has been as far as its processes, but the formatting scripts are a mess... The issue is that there are line limits for SPAQRL queries on Wikidata, so we can't just write whatever SPAQRL we want, but have to start from relatively raw data. Going through and doing all of these conditional checks via Python leads to code that's not fun to maintain, whereas leveraging the structure of SQL and by loading the results and "querying" them to format them in Scribe-Data could be much cleaner.

I'd be happy to do an exploratory query to replace one formatting process as a proof of concept, and then we could break it up and make some issues if it all works out?

If we can get this figured out I think the overall code base will be much nicer to work with and from there we can work on expanding out the current queries to include new languages as well. The beauty of it then is that we're to a non-trivial degree decoupling Scribe-Data/Server from Scribe-iOS where the latter is just a client of the former process 😊

@andrewtavis
Copy link
Member

Looking through things, @wkyoshida, I think that this issue has served us well and we're ready to close it up. Future issues can be made into individual ones, but generally what's in here is covered by GSoC '24 and Scribe-Server 🚀

Before closing, once more for old times sake 🙃
Language Packs Logo

Thanks for this great conversation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Archived in project
Development

No branches or pull requests

2 participants