Skip to content

FabianUlrich96/Data-retrieval

Repository files navigation

Data-retrieval

Data used in the study: Stack Exchange Data Dump 01.06.2020: https://archive.org/download/stackexchange

Files and their use

File Description
parse_xml.py Function file to parse the XML files from the Data Dump to .csv files
related_tags.py Function file to count the tags related to one given search term
related_tag_count.py Function file to merge the tag counts of two files generated with related_tags.py
total_tag_count.py Function file to generate a .csv file with the total count of tags in a dataset
threshold_calculation.py Function file to calculate the TRT1 and TST2 threshold
UserSelection.py Class file with general functions for the user input
CsvAction.py Class file with general functions to edit .csv files

## Thresholds Thresholds are taken from Rosen and Shibab's study (2016) about mobile development on Stack Overflow. To filter out the for them relevant tags they used two threshold: > TRT1 Number of mobile posts / Total number of posts. Number of mobile posts = the number of posts that contained at least one of the initial set of keywords. Total number of posts = total number of posts that is related to the searched tag. > TST2 Number of mobile posts / Number of mobile posts for the most popular tag. Number of mobile posts = the number of posts that contained at least one of the initial set of keywords. Number of mobile posts for the most popular tag =
popular_number_count = mobile_number['Number_count'].argmax()

Working steps

Parsing

posts.xml and tags.xml got parsed with the help of the parse_xml.py file.

Initial list of keywords 3

Category Keyword
Development android-studio
Development android-sdk
Development android-app
Development android
Languages kotlin

3 The initial list of keywods is the same for both the English Stack Overflow community and the Russian Stack Overflow community, because there are no results for the cyrillic written form of google or android (гоогле or андроид)

Generating the intial list of keywords

List got generated by first searching for tags which are partly labled with android, google or kotlin (result 165 tags). After that the tags got manually categorized.

Category Amount of tags
Layout/XML 25 tags
Android c 2 tags
Android general 85 tags
Google mobile 6 tags
Google other 44 tags
kotlin 3 tags
  1. The category "Layout/XML" can be filtered out completly, because the researched languages of this study are Kotlin and Java.
  2. The category "Android c" can be filtered out completly, because the researched languages of this study are Kotlin and Java.
  3. The category "Google other" can be filtered out completly, because mobile is the scope of this study.
  4. The category "Google mobile" can be filtered out completly, because results results for the tags seem to be more general about the services themself, than about specific programming related issues.
  5. The category "kotlin" can be limited to only the tag "kotlin", because results for the tag "kotlin-faq" result in posts which are also tagged with "kotlin" and results for "kotlin-native" are tagged with "c" (researched language of this study are Kotlin and Java)
  6. The category "Android general" can be limited to 4 tags, which are essential for every Android project. Most tags which contain the substring android are topic, or element specific (e.g. android-camera, or android-toast)

Disclaimer: A category hardware as suggested in the study is not necessary, because posts with the tags "galaxy" and "nexus" are scarce and a greater variety of devices makes it uncommon for Stack Overflow users to mention their used hardware device with a tag. A category for Android versions is not necessary because the only tags that are available are "android-5.0-lollipop", "pie" and "android-tv" (two of the three "androidtv" posts are either labled with "android-sdk" or "android")

Getting the count of the mobile tags

related_tags.py got executed for each keyword in the initial keywords table, after that they got combined in pairs of two with the related_tag_count.py file.

Getting the total count of tags

To get the total count of tags, the total_tag_count.py file got executed on the before parsed posts.csv file.

Getting the TRT and TST values

For the thresholds the threshold_calculation.py got executed selecting the before genereated mobile tags count and total count files. TRT results are up to 2 because tags got used together with several of the initial keywords. Since the posts with a TRT value higher than 1 are low volume posts they will be filtered out here and only tags with a TRT < 1 > 0.4 will be used.

Tag TST
listview ~ 0,8845
sqlite ~ 0,4391
gradle ~ 0,918
firebase ~ 0,7346
google-play ~ 0,8925
webview ~ 0,8519
--- ---
Like Rosen and Shibab's study (2016) the threshold for the TST will be 1% as suggested in the study.
Tag TST
--- ---
listview 0.02956274802490261
sqlite 0.025084647030982638
gradle 0.01671096224560381
firebase 0.016019223067681214
google-play 0.012087231951068554
webview 0.01004842174245458

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages