Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements #1

Open
7 of 14 tasks
joeflack4 opened this issue Jun 16, 2022 · 4 comments
Open
7 of 14 tasks

Requirements #1

joeflack4 opened this issue Jun 16, 2022 · 4 comments
Assignees

Comments

@joeflack4
Copy link
Member

joeflack4 commented Jun 16, 2022

Description

(Originally taken from: Requirements google doc)
Zulip Terminology Stream Text Mining Project

Base Zulip bulletin board application is supported by a REST API that can be interrogated (?) via Python scripts. Bots can be configured via Python to provide real-time monitoring as well. Text mining of the terminology stream in the FHIR Zulip community bulletin board to discover trends regarding use of terminologies and terminology services within the HL7 FHIR community.

Objective of this exercise is to review the history of the content and activity Terminology stream.

Task list

Task details

(Refer to for more info, especially for 1-5: Requirements google doc)

6a. Thread length

6a.i. Average length of threads: Determine average length (in days / wells / months) in terminology stream threads.
6b.i. Identify outlier threads in terms of length: Identify outliers in length - longer running threads

Possible solutions:
For this, can aggregate all thread lengths (i.e. in terms of number of messages) and report 2 different classes of identifiers: (i) 1 standard deviation away from norm, and (ii) 2 standard deviations.

6b. Threads lacking adequate resolution

Identify those topics with (i) many responses (not necessarily with longer length, but will likely be one of these as well) that (ii) do not have some sort of resolution. Will require iterative review with SME (Davera or others)

Possible solutions:
(i) Many responses: Can potentially be defined as 1 standard deviation away from mean.
(ii) Lacking resolution: This would likely be too time consuming to automate; so should go with suggestion of SME review. However, we could programmatically automate / aid this analysis, perhaps, by re-reading the analytical output. The output (likely a CSV file) could have 1+ codified curator columns, where data will be manually entered by SMEs. Then, that information could be re-read if further programmatic analysis is needed.

6c. Frequency variance

For each of the count categories (1-4) above, when is the occurrence of these topics, when are they more frequent / less frequent

6d. Activity variance

Date-base counts for all topics indicating activity levels: when is the stream more active / less active

Additional info

Links

  1. Requirements google doc
  2. Chat URL: http://chat.fhir.org
  3. Zulip API docs: https://zulip.com/api/rest
  4. Category keywords google sheet
@joeflack4
Copy link
Member Author

joeflack4 commented Jun 16, 2022

@DaveraGabriel FYI
@stephanieshong I don't remember who else might be working on this, but feel free to link them to this / or "add to assignees".

@stephanieshong
Copy link

We will assign this task to Rohan Hurer.

@stephanieshong
Copy link

stephanieshong commented Jun 16, 2022

#example of nlp keyword search that might be useful:

get_ipython().system('pip3 install --user nltk flashtext')

nltk.download('punkt')
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor()
keyword_dict = {
     "codesystem": ["DICOM","SNOMED", "LOINC", "ICD10CM", "ICD10PCS", "NDC", "RxNorm" ],
     "HL7Productfamilies": ["CDA", "C-CDA", "V3", "Version3"], 
     "TerminologyResources": ["ConceptMap", "CodeSystem","ValueSet","Terminology Service","TerminologyCapabilities", "NamingSystem", "Coding", "Code", "CodeableConcept"],
     "Operations": ["$lookup", "$validate-code", "$subsumes", "$find-matches", "$expand", "$validate-code", "$translate", "$closure"]
}

keyword_processor.add_keywords_from_dict(keyword_dict)
keyword_processor.extract_keywords('zulip activities based on code system, HL7Product family, Terminology Resources and Operations')

@joeflack4
Copy link
Member Author

joeflack4 commented Jun 21, 2022

Some options we discussed:
a. Fetch stream topic message text strings and query them separately, then aggregate the results.
b. Concatenating the text of all topics together into one big string of text, and then query that.

My instincts lean me towards (a) for some reason, but I think both are potentially good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants