Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEO: resolve potential duplicite content #19

Open
marekcierny opened this issue Jan 4, 2016 · 17 comments
Open

SEO: resolve potential duplicite content #19

marekcierny opened this issue Jan 4, 2016 · 17 comments
Labels

Comments

@marekcierny
Copy link
Contributor

Several examples of potential duplicite content exist:

  1. shift between language versions: https://anatom.cz/en/ - https://practiceanatomy.com/
  2. user registration: anatom.cz/view/LE/?sessionid= - anatom.cz/view/LE/
  3. view of a particular image https://anatom.cz/view/04/?context=svaly-krkusvg - https://anatom.cz/view/04/

Duplicite content should be a) avoided if possible, b) resolved by redirect 301, or C) resolved by <link rel="canonical" (https://support.google.com/webmasters/answer/139066).

@slaweet
Copy link
Member

slaweet commented Jan 5, 2016

  1. has been resolved by redirect 301 for some time, but google still didn't reindex it. More recently I tried to disallow /en/ and /cs/ urls in robots.txt
  2. has been resolved for some time as well, it's just still hanging in google index
  3. is a TODO

@marekcierny
Copy link
Contributor Author

  1. I would argue against disallowing /en/ and /cs/ urls in robots.txt, as any link to an URL which robots cannot access leads to loss of page rank (it might prevent it from seeing the redirect).

Ultimately, I think C) meta "canonical" should be added to every page to resolve any potential duplicate content we might miss... (E.g. tracking campaigns and traffic sources)

@marekcierny
Copy link
Contributor Author

I wrote a simple PHP function that rewrites any url into a "canonical url".
canonical.TXT

If "echo get_canonical_meta($url)" can be added into every page , it can help us explain to search engines our duplicite content.

@papousek
Copy link
Member

papousek commented Jan 9, 2016

Unfortunately, the application is written in Python, so we can not include your script into every page view directly. On the other hand, I assume we are able to rewrite it into Python (@slaweet?)

slaweet added a commit that referenced this issue Jan 10, 2016
@slaweet
Copy link
Member

slaweet commented Jan 10, 2016

I added canonical urls (f9a7454).
I'm just stripping query string (everything after ?). I changed /overview/?tab=location to /overview/tab/location because of it.
I didn't implement the part with changing domain in canonical, because it wouldn't get ever executed, because the 301 redirect gets executed first and then we are on the correct domain.

@slaweet
Copy link
Member

slaweet commented Jan 10, 2016

As for disallowing /en/ and /cs/ I removed it from robots.txt, but I don't see why it should influence page rank of any other page then the ones with /en/ and /cs/, which we don't want in search results anyway. And IMO we don't want Google to see the redirect, but directly the alternative language version through <link rel="alternate" ...

@marekcierny
Copy link
Contributor Author

OK.
As for disallowing /en/ and /cs/ in robots: http://webmasters.stackexchange.com/questions/54240/is-it-safe-to-block-redirected-but-still-linked-urls-with-robots-txt (In general, my understanding is dissallowing robots to any url we link to within our site is not good.)

The canonical form of the url is also related to <link rel="alternate" sitemap: only canonical forms of urls should be linked as another language version.
For example, on https://anatom.cz/practice//, the canonical url is https://anatom.cz/practice/, and the alternate languagesshould also end onlz with one /.

slaweet added a commit that referenced this issue Jan 10, 2016
@slaweet
Copy link
Member

slaweet commented Jan 10, 2016

I've updated <link rel="alternate" (1d33303), even though I don't think it matters what is on the non-canonical pages, as Google is only going to look at (index) the canonical ones.
I've also added '//' -> '/' replacement to canonical url.

@marekcierny
Copy link
Contributor Author

Thank you, Víťo.
Do you use www.google.com/webmasters/tools/ to check for SEO warnings/errors? (I think it's a great tool, especially as we want to ad more languages and content in the future.)
I've just noticed that when logged in, the view-source:https://anatom.cz/ shows canonical address "https://anatom.cz/overview/". But when logged off, it's correct.

@marekcierny marekcierny added the FI label Jan 13, 2016
@marekcierny
Copy link
Contributor Author

I might be too picky, but other potential duplicate content is
4. url with "/" and without "/" at the end. (e.g. https://anatom.cz/practice/A [chapter selected with a tick] https://anatom.cz/practice/A/ [chapter selected with click on an arrow])
5. selection of chapters for practice (e.g. https://anatom.cz/practice/09/LE and https://anatom.cz/practice/LE/09 [the second url accessible from anatom.cz/view/LE/ - vybrat podkapitolu])

@slaweet
Copy link
Member

slaweet commented Jan 13, 2016

view-source:https://anatom.cz/ for logged in users actually redirects to view-source:https://anatom.cz/overview (notice address bar). Hopefully, search engines cannot log in :-)

I use www.google.com/webmasters/tools/ every now and then, I haven't noticed any SEO warnings or errors there. I've linked Webmaster tools with GA, so it probably displays the errors in GA as well.

Ad 4 and 5: I see the problem, I'll have to think about how to solve it technically.

@marekcierny
Copy link
Contributor Author

Although there is no link to such a page, not sure if this could be problem for search engines or users/brand/security:
https://anatom.cz/overview/V%C3%ADt%C3%A1%20v%C3%A1s%20blbe%C4%8Dek
https://anatom.cz/view/02/V%C3%ADt%C3%A1%20v%C3%A1s%20blbe%C4%8Dek
(random url parameter is recognized as canonical, and the random text is displayed in heading)

@slaweet
Copy link
Member

slaweet commented Jan 15, 2016

Re #19 (comment):
Good catch.
That URL is actually a link to view knowledge of a user, e.g.
https://anatom.cz/overview/slaweet
https://anatom.cz/overview/cierny.m

The problem is that we don't do the check if the given string is a valid username. If not, then the page should return an error.

@marekcierny
Copy link
Contributor Author

Víťo, when I suggested to make a separate url for /overview/?tab=location in order to get the crawler see our main content tree, I didn't know that google can understand AJAX.
Now I think it wasn't a good idea from the start, and we might be better without it. I am sorry to make it complicated.

@slaweet
Copy link
Member

slaweet commented Feb 3, 2016

Marku, I don't think Google AJAX crawling scheme is applicable here. Anything we want to appear in search results (like /overview/?tab=location) has to be on a separate url.

@slaweet
Copy link
Member

slaweet commented Feb 3, 2016

And FYI, your example with "Vítá vás blbeček" has been indexed by google as Google crawled our Github :-)
FYI no.2 the problem with SEO in GA was just reporting issue and was caused by http -> https migration in December. Our impressions changed to https vesion of anatom.cz and those were not listed.

@marekcierny
Copy link
Contributor Author

First, I am concerned we have very similar content (and identical ) when user view in image under different chapters/body parts (eg. practiceanatomy.com/view/UE/image/casti-lidskeho-telasvg and practiceanatomy.com/view/LE/image/casti-lidskeho-telasvg). Can we change the url to practiceanatomy.com/view/LE/#image/casti-lidskeho-telasvg or practiceanatomy.com/view/LE/#image/5 ?

Second, I've found a simple SEO guide, and there are several things we do not do yet:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants