Skip to content
This repository has been archived by the owner on Jan 9, 2025. It is now read-only.

Long conjunctive query phrases make TCAT not start #334

Open
xmacex opened this issue Sep 30, 2018 · 10 comments · Fixed by #335
Open

Long conjunctive query phrases make TCAT not start #334

xmacex opened this issue Sep 30, 2018 · 10 comments · Fixed by #335
Labels

Comments

@xmacex
Copy link
Contributor

xmacex commented Sep 30, 2018

Hi. The following, long conjunctive query phrase #Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem will make TCAT not restart after the routine reload which is performed shortly after a query bin is added, as observed in logs/controller.php

2018-09-30 19:38:02 enforcing reload of config for track
2018-09-30 19:38:02 sending a TERM signal to track for 10060
2018-09-30 19:38:02 waiting for graceful exit of script track with pid 10060

I am waiting for the process dmitcat_track.php to respawn via the normal cronjob by monitoring its appearance in ps afx |grep 'var/www/dmi-tcat' |grep "track" listing. It won't be there. Removing the bin from the usual web UI brings the dmitcat_track.php back, after the routine reload is done (in a dozen or so seconds).

screen shot 2018-09-30 at 19 52 08

The query phrase is long, 142 characters. If I arbitrarily split it into two by placing a disjuctive comma in the middle #Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit, #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem the same issue still persists.

However, if I further split these into two disjunctions each, #Chinabigbrother #Citizenrating, #Chinasurveillance #Socialcredit, #Chinablackmirror #Digitaldictatorship, #Digitalcensorship #Socialcreditsystem, things work as usual. Of course any actual query design is out the window by this point (if it ever was there, but that's an another matter).

I observed the unabridged query phrase go into the database. The bin gets it's entries in tcat_query_bins_* tables, as usual.

I am not particularly expert in PHP / SQL webapp debugging, so if someone can please guide me how to get more information about the program flow, where to request debug messages, throw PHP exceptions and stack traces, or maybe even run the dmitcat_track.php under XDebug or another debugging tool, I would appreciate it.

I hope I will be able the send a pull request once I have a better idea where to get more information what goes wrong.

@xmacex
Copy link
Contributor Author

xmacex commented Sep 30, 2018

Oops sorry I was neglecting the logs/track.error.log, silly me. With this TCAT query bin in effect, we get a HTTP/1.1 406 Not Acceptable response, and the dmitcat_track.php terminates silently terminates. I won't be able to paste in the whole thing from the log file for security reasons, sorry. I'm afraid there is more TCAT source code reading for me to do...

@xmacex
Copy link
Contributor Author

xmacex commented Sep 30, 2018

Twitter Standard streaming API request parameters says about track

Each phrase must be between 1 and 60 bytes, inclusive.

Does it make any sense to document this issue like so, without writing this test against TCAT's codebase

<?php
/**
 * See https://github.com/digitalmethodsinitiative/dmi-tcat/issues/334 for the issue
 */
use PHPUnit\Framework\TestCase;

final class QueryPhraseTest extends TestCase
{
    /** 
     * Test that any phrase is within the Twitter API spec
     * https://developer.twitter.com/en/docs/tweets/filter-realtime/guides/basic-stream-parameters.html
     *
     * @dataProvider phraseProvider
     */
    public function testPhraseLengthIsWithinTwitterApiTrackSpecs($phrase)
    {
        $minLength = 0;
        $maxLength = 60;
        $this->assertGreaterThanOrEqual($minLength, strlen($phrase));
        $this->assertLessThanOrEqual($maxLength, strlen($phrase));
    }

    /**
     * Data providers
     */
    public function phraseProvider()
    {
        return [
            ["#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem"],
            ["#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit",
             "#Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem"],
            ["#Chinabigbrother #Citizenrating",
             "#Chinasurveillance #Socialcredit",
             "#Chinablackmirror #Digitaldictatorship",
             "#Digitalcensorship #Socialcreditsystem"],
        ];
    }
}

Which currently gives us

PHPUnit 7.3.5 by Sebastian Bergmann and contributors.

FF. 3 / 3 (100%)

Time: 122 ms, Memory: 10.00MB

There were 2 failures:

  1. QueryPhraseTest::testPhraseLengthIsWithinTwitterApiTrackSpecs with data set #0 ('#Chinabigbrother #Citizenrati...system')
    Failed asserting that 142 is equal to 60 or is less than 60.

/Users/maco/tip/QueryPhraseTest.php:20

  1. QueryPhraseTest::testPhraseLengthIsWithinTwitterApiTrackSpecs with data set URL expander exits with error #1 ('#Chinabigbrother #Citizenrati...credit', '#Chinablackmirror #Digitaldic...system')
    Failed asserting that 64 is equal to 60 or is less than 60.

/Users/maco/tip/QueryPhraseTest.php:20

FAILURES!
Tests: 3, Assertions: 12, Failures: 2.

Anyway, towards a solution I would be inclined to fortify the web UI from

dmi-tcat/capture/index.php

Lines 488 to 493 in 85504de

if(params['type']=='track') {
var _nrOfPhrases = validateNumberOfPhrases(params['oldphrases'].split(",").length,_newphrases.split(",").length);
if(!_nrOfPhrases) {
alert("With this query you will exceed the number of allowed queries (400) to the Twitter API. Please reduce the number of phrases.");
return false;
}

to something like

                if(params['type']=='track') {
                    var _nrOfPhrases = validateNumberOfPhrases(params['oldphrases'].split(",").length,_newphrases.split(",").length);
                    if(!_nrOfPhrases) {
                        alert("With this query you will exceed the number of allowed queries (400) to the Twitter API. Please reduce the number of phrases.");
                        return false;
                    }
                    var _lenOfPhrases = validateLengthOfPhrases(_newphrases);
                    if(!_lenOfPhrases) {
                        alert("One or more of the phrases exceeds the allowed length (60 characters) to the Twitter API. Please shorten your phrases.");
                        return false;

with validateLengthOfPhrases(phrases) being something like phrases.split(',').every(phrase => phrase.length <= 60). Some fortification further down the stack would be good too, I imagine, before phrases are accepted into the database.

@xmacex
Copy link
Contributor Author

xmacex commented Sep 30, 2018

Or alternatively fortify the web UI in here

function validateQuery(query,type) {

@ErikBorra
Copy link
Member

ErikBorra commented Oct 1, 2018

Hi @xmacex,

thanks for submitting this issue.

As you stated, the phrase does not make any sense as it would only track tweets which contain exactly "#Chinabigbrother #Citizenrating #Chinasurveillance #Socialcredit #Chinablackmirror #Digitaldictatorship #Digitalcensorship #Socialcreditsystem".

TCAT should indeed validate the length of phrases, although it should not insert arbitrary comma's thus modifying the phrases input by the user. Instead, the user could get a warning indicating that the phrase to track is too long.

Best,

Erik

@xmacex
Copy link
Contributor Author

xmacex commented Oct 1, 2018

@ErikBorra I suggest let's keep the query design (Rogers 2017; and of course a whole body of literature in information retrieval) issues separate from input string validation. Of course as STS scholars this is a moment of reflection about how this divide exactly gets constantly made and re-made in the everyday practices 😆

@xmacex
Copy link
Contributor Author

xmacex commented Oct 1, 2018

Anyway I'd be happy to work on a pull request. I imagine input validation at JS as suggested in #334 (comment), but if someone can nudge me towards what would be a good place to introduce some fortification further down the stack, in I'll do that too while I'm on it. The function create_new_bin

function create_new_bin($params) {
looks like a good candidate

@xmacex
Copy link
Contributor Author

xmacex commented Oct 2, 2018

Working on it here https://github.com/xmacex/dmi-tcat/tree/phrases_must_not_exceed_60chrs_each. I really don't know how to provide tests for Javascript functions buried in PHP 😟

@xmacex
Copy link
Contributor Author

xmacex commented Oct 3, 2018

A terminological observation about myself: I seem to have converged towars speaking of kitten,lizard,'large aardvark',anteater as one query containing four phrases.

@dentoir
Copy link
Contributor

dentoir commented Nov 12, 2018

Hi @xmacex

Sorry for the late reply on this and thank you for your pull request. It looks sound in principle. Some quick remarks before I can merge this. Maybe we should add a maximum length to the @username in the follow role as well (which appears to be 15 characters). There is a third location where (individual) phrases are validated, which is validate_capture_phrases() in common/functions.php and is called by CLI scripts such as search.php. Finally, could you detach your unit test from the pull request? It would leave a script in the tree which has non-present dependencies.

@xmacex
Copy link
Contributor Author

xmacex commented Nov 13, 2018

Thanks @dentoir. The Twitter API documentation for follow does not seem to state that the lengths of usernames are limited as API query time (of course on new user registration, yes). Exceeding the 60 characters limit for a track, the Twitter API responds with a HTTP/1.1 406 Not Acceptable. Does it give an error in the HTTP response, and freak out TCAT if asked to follow a user name too long? I am not sure if I can currently verify that myself.

I implemented validate_capture_phrases() hopefully as suggested in commit df582ee, sorry I missed that. Thanks for pointing it out. If someone can take a quick review of the PR #335, I'll detach the test suite and nudge the pull request.

@niczem niczem added the bug label Jan 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants